Gemini Embedding 2: Putting Text, Images, Video, and Audio in One Vector Space

Google explains Gemini Embedding 2's multimodal embedding capabilities: unified handling of text, images, video, audio, and documents for RAG, visual search, reranking, and classification.

Google Developers Blog introduced how to build with Gemini Embedding 2. The model is now generally available through the Gemini API and Gemini Enterprise Agent Platform. The important point is not simply that it is a new embedding model, but that it maps text, images, video, audio, and documents into the same semantic space.

This broadens what retrieval systems can handle. Many RAG pipelines previously had to convert images, video, or audio into text or metadata before indexing them separately. Gemini Embedding 2 can process multimodal inputs directly, making it easier for agents, search systems, and classifiers to work with real business materials.

Original article: Building with Gemini Embedding 2: Agentic multimodal RAG and beyond

Model Capabilities

Gemini Embedding 2 supports more than 100 languages. A single request can process:

  • Up to 8,192 text tokens
  • Up to 6 images
  • Up to 120 seconds of video
  • Up to 180 seconds of audio
  • Up to 6 pages of PDF

Its key idea is a unified semantic space. Developers can place content from different modalities into one vector representation system, then use the same retrieval, clustering, or reranking logic to process it.

For example, a text description and an image can be included in the same embedding request:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
from google import genai
from google.genai import types

client = genai.Client()

with open('dog.png', 'rb') as f:
    image_bytes = f.read()
result = client.models.embed_content(
    model='gemini-embedding-2',
    contents=[
        "An image of a dog",
        types.Part.from_bytes(
            data=image_bytes,
            mime_type='image/png',
        ),
    ]
)

print(result.embeddings)

If you want a separate embedding for each input rather than one aggregated vector, you can use the Batch API. The original article also notes that Agent Platform support for this kind of batch workflow is still in progress.

What It Means for RAG

Multimodal embeddings are useful for agentic RAG. An AI agent may need to inspect a code repository, PDFs, screenshots, charts, audio meeting notes, and product images at the same time. If all of these materials can enter the same semantic space, the retrieval pipeline no longer needs a separate entry point for every format.

Google recommends using task prefixes according to the goal of the task, so the embeddings better match the retrieval objective. For example, question answering, fact checking, code retrieval, and search results can use different prefixes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Generate embedding for your task's query:
def prepare_query(query):
    return f"task: question answering | query: {content}"
    # return f"task: fact checking | query: {content}"
    # return f"task: code retrieval | query: {content}"
    # return f"task: search result | query: {content}"
# Generate embedding for document of an asymmetric retrieval task:
def prepare_document(content, title=None):
    if title is None:
        title = "none"
    return f"title: {title} | text: {content}"

This kind of prefix is suitable for asymmetric retrieval: user queries are often short, while documents are often long. Formatting query and document differently for the task can improve matching between short queries and long documents.

The original article gives two real-world examples:

  • Harvey saw a 3% increase in Recall@20 precision on legal retrieval benchmarks compared with its previous embeddings.
  • Supermemory saw a 40% increase in Recall@1 search accuracy and uses it across memory, indexing, search, and Q&A pipelines.

These numbers do not mean every scenario will improve by the same amount, but they show that multimodal embeddings are already producing results in real retrieval products, not only demos.

Gemini Embedding 2 is also suitable for image-to-image search, image-text hybrid search, and product identification. The original article mentions Nuuly, URBN’s clothing rental company, using it to match photos of untagged garments in warehouses against its catalog. Match@20 improved from 60% to nearly 87%, and the overall successful identification rate rose from 74% to over 90%.

The point in this type of scenario is not content generation, but understanding which inventory item, document, or product record is closest to a given image. If your business has many images, video clips, or scanned documents, multimodal embeddings can be more natural than text-only indexing.

Search Reranking

Embeddings can also be used for reranking. A common approach is to first retrieve a set of candidate results, then calculate the similarity between each candidate and the user’s query, pushing more relevant content to the top:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# 1. Define a function to calculate the dot product (cosine similarity)
def dot_product(a: np.ndarray, b: np.ndarray):
  return (np.array(a) @ np.array(b).T)
# 2. Retrieve your embeddings
# (Assuming 'summaries' is your list of search results)
search_res = get_embeddings(summaries)
embedded_query = get_embeddings([query])

# 3. Calculate similarity scores
sim_value = dot_product(search_res, embedded_query)

# 4. Select the most relevant result
best_match_index = np.argmax(sim_value)

The original article also mentions another idea: first ask the model to generate a baseline hypothetical answer from its internal knowledge, embed that answer, and compare it with candidate content to find the most semantically relevant result. This is especially useful for Q&A-style RAG.

Clustering, Classification, and Anomaly Detection

Beyond retrieval, embeddings are also useful for clustering, classification, and anomaly detection. Unlike the asymmetric question-answering retrieval above, these are symmetric tasks, where the same task prefix can be used for both query and document:

1
2
3
4
5
# Generate embedding for query & document of your task.
def prepare_query_and_document(content):
    # return f'task: clustering | query: {content}'
    # return f'task: sentence similarity | query: {content}'
    # return f'task: classification | query: {content}'

These tasks can be used for sentiment classification, content moderation, similar asset grouping, and anomaly discovery. They can also help agents organize large amounts of context before moving into later reasoning steps.

Storage and Cost

Gemini Embedding 2 outputs 3,072-dimensional vectors by default. It uses Matryoshka Representation Learning, so vectors can be truncated to smaller dimensions with output_dimensionality. Google recommends 1,536 or 768 dimensions when efficiency is the priority:

1
2
3
4
5
result = client.models.embed_content(
    model="gemini-embedding-2",
    contents="What is the meaning of life?",
    config={"output_dimensionality": 768}
)

Vectors can be stored in Agent Platform Vector Search, Pinecone, Weaviate, Qdrant, ChromaDB, and similar systems. For cost, the original article notes that the Batch API provides higher throughput and reaches 50% of the default embedding price.

How Developers Can Use It

If you already have text-based RAG, you can start with two incremental upgrades:

  1. Put PDFs, screenshots, image descriptions, and text documents into the same index, then test whether retrieval recall becomes more stable.
  2. Add task prefixes for different tasks, such as question answering, fact checking, code retrieval, and product search. Do not process all content with the same embedding format.

If you are building a new product, consider these directions first:

  • Enterprise knowledge bases: retrieve documents, charts, presentation screenshots, and meeting materials together.
  • Visual search: use images, text, or mixed inputs to find products, assets, design drafts, and archives.
  • Agent toolchains: let coding agents, research agents, or customer support agents retrieve business materials in multiple formats.
  • Content governance: classify, cluster, and detect anomalies across text, images, and video clips.

The value of Gemini Embedding 2 is that it turns multimodal materials into one searchable asset system. For developers, this reduces the need for an intermediate “convert to text, then retrieve” layer and makes RAG systems closer to the shape of real-world data.

记录并分享
Built with Hugo
Theme Stack designed by Jimmy