Skip to content

efekurucay/gemini-multimodal-embedding-examples

Repository files navigation

🧠 Gemini Multimodal Embedding – Examples

Practical, runnable examples for Gemini Embedding 2 (gemini-embedding-2-preview) — the first fully multimodal embedding model that maps text, images, video, audio and PDFs into the same vector space.


✨ What's Inside

Script Description
01_text_embedding.py Single & batch text embedding, task types, 768D normalization
02_video_embedding.py Inline upload, Files API, video+text cross-modal, long-video chunking
03_multimodal_embedding.py Cross-modal search: find videos with a text query
04_search.py Semantic search over saved embeddings
05_describe.py Analyze a saved video embedding against 100 predefined topics to describe what the video is about

🚀 Quick Start

1. Clone & install dependencies

git clone https://github.com/YOUR_USERNAME/gemini-multimodal-embedding-examples.git
cd gemini-multimodal-embedding-examples

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

pip install -r requirements.txt

2. Set up your API key

cp .env.example .env
# Open .env and paste your Google AI API key

Get a free API key at → aistudio.google.com/app/apikey

3. (Optional) Get a sample video

curl -L "https://www.w3schools.com/html/mov_bbb.mp4" -o sample.mp4

📖 Usage

Text embedding

python 01_text_embedding.py

Demonstrates:

  • Single text → 3072-dimensional vector
  • Batch embedding (multiple texts in one call)
  • Task-type-aware embeddings (RETRIEVAL_QUERY vs RETRIEVAL_DOCUMENT)
  • Output dimensionality reduction (3072 → 768) with L2 normalization

Video embedding

python 02_video_embedding.py sample.mp4

Three different strategies:

  1. Inline – reads the file as bytes (best for small videos < 10 MB)
  2. Files API – uploads the video first (recommended for large files)
  3. Video + Text – combines video bytes with a textual description into a single embedding

Embeddings are saved as JSON in the embeddings/ folder for later reuse.


Cross-modal search

python 03_multimodal_embedding.py

Embeds multiple videos and a text query, then ranks the videos by cosine similarity to the query — no separate index needed.


Semantic search over saved embeddings

python 04_search.py

Loads previously saved embedding JSON files and performs nearest-neighbour search against a text query.


Describe a video

python 05_describe.py embeddings/<your_video_embedding>.json
python 05_describe.py embeddings/<your_video_embedding>.json --top 10

Compares the video embedding against 100 predefined topic phrases across categories like food, sports, nature, technology, and emotions. Outputs a ranked list with a visual score bar.

=================================================================
  This video is most likely about (Top 8):
=================================================================
  #1   72.3%  [██████████████████░░░░░░░]  yemek yeme, restoranda yemek
  #2   68.1%  [█████████████████░░░░░░░░]  fast food yeme
  ...
=================================================================

🔬 Model Reference

Property Value
Model ID gemini-embedding-2-preview
Supported inputs Text, Image (PNG/JPEG), Video (MP4/MOV), Audio (MP3/WAV), PDF
Max input 8,192 tokens
Default output dims 3,072
Configurable dims 128 – 3,072
Recommended dims 768, 1,536, 3,072

Task Types

Task Type Use When
RETRIEVAL_QUERY Embedding a search query
RETRIEVAL_DOCUMENT Embedding documents to be indexed
SEMANTIC_SIMILARITY Comparing two pieces of content
CLASSIFICATION Sentiment analysis, topic classification
CLUSTERING Grouping similar content
QUESTION_ANSWERING The question side of a QA system
CODE_RETRIEVAL_QUERY Code search queries

⚠️ Important Notes

  1. Normalization – The default 3,072D output is already normalized. If you reduce to 768D or 1,536D, apply L2 normalization manually.
  2. Video limit – Each embedding call accepts at most 128 seconds of video. Use the chunking helper in 02_video_embedding.py for longer videos.
  3. Model incompatibility – Embeddings from gemini-embedding-001 and gemini-embedding-2-preview live in different vector spaces. If you upgrade, you must re-embed all your data.
  4. Cost tip – The Batch API offers up to 50% discount if latency is not critical.

📁 Project Structure

.
├── 01_text_embedding.py       # Text embedding examples
├── 02_video_embedding.py      # Video embedding (inline / Files API / cross-modal)
├── 03_multimodal_embedding.py # Cross-modal text → video search
├── 04_search.py               # Semantic search over saved embeddings
├── 05_describe.py             # Video content analysis via topic matching
├── embeddings/                # Saved embedding JSON files (git-ignored)
├── requirements.txt
├── .env.example
└── README.md

🤝 Contributing

Pull requests are welcome! Feel free to:

  • Add more modalities (images, audio, PDFs)
  • Improve the topic list in 05_describe.py
  • Add vector database integration examples (Pinecone, Qdrant, pgvector…)

📄 License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages