← All Articles

PRE-MERGER SNAPSHOT This article has been subsumed into Vector RAG: Inside the Dense-Vector Retrieval Stack, which combines this piece with two siblings (vector-database internals and chunking strategies) and adds a Part 4 on how the three layers cascade. The original remains here as a pre-merger snapshot. For the canonical Week 5 reading, follow the link above.

The Embedding Model Landscape

Vector similarity is only as good as the model that produces the vectors. Most RAG failures trace back to this first decision, and most teams make it without enough information. Selection, evaluation, and fine-tuning, with the tradeoffs each step assumes.

If you have read the earlier articles in this series, you understand the geometry. Words become vectors. Similar meanings cluster. Cosine similarity measures the angle between them. The distributional hypothesis, first articulated by Firth in 1957, underpins the entire edifice: words that appear in similar contexts develop similar representations.

That is the theory. The practice looks different.

When you sit down to build a retrieval-augmented generation system, the first concrete decision you face is which embedding model to use. Not which architecture is theoretically elegant. Not which paper introduced the most novel training objective. Which model, right now, will turn your documents and queries into vectors that actually retrieve the right passages.

This decision is consequential. A retrieval system that returns irrelevant passages forces the language model to hallucinate or hedge, regardless of how capable that model is. The embedding model is the foundation. Everything downstream depends on it.

The Current Landscape

The embedding model ecosystem has expanded dramatically since 2022. Where once you had Word2Vec and maybe Sentence-BERT, you now face a crowded field of commercial APIs, open-source models, and specialized variants. Understanding the major players is the first step toward an informed choice.

Commercial APIs

OpenAI text-embedding-3 ships in two variants: text-embedding-3-small (1536 dimensions, cheaper) and text-embedding-3-large (3072 dimensions, more capable). Both support Matryoshka representation learning, meaning you can truncate the output to fewer dimensions (256, 512, 1024) with graceful quality degradation rather than catastrophic collapse. This is a practical feature: you can tune the cost-quality tradeoff after model selection, without retraining anything.

Cohere embed-v3 introduced explicit input type parameters: search_document, search_query, classification, and clustering. The model adjusts its internal behavior based on which type you specify. This is not cosmetic. A query like "What causes memory leaks in Python?" and a passage explaining garbage collection serve different retrieval roles; encoding that asymmetry into the model improves recall. Cohere also supports 1024 dimensions by default and offers compression to binary or integer embeddings for storage efficiency.

Google's Gecko (part of the Vertex AI family) and various models available through Amazon Bedrock round out the commercial options. Each API has its own pricing model, rate limits, and dimension choices.

Open-Source Models

The open-source landscape is where the real action has been. Several families of models have emerged, each with distinct training strategies.

BGE (BAAI General Embedding) from the Beijing Academy of Artificial Intelligence uses a multi-stage training pipeline: pre-training on large-scale unsupervised data, then fine-tuning with contrastive learning on curated pairs. The bge-large-en-v1.5 model at 1024 dimensions has been a workhorse for production systems. The newer bge-m3 supports multi-lingual, multi-granularity, and multi-functionality embedding in a single model (Xiao et al., 2023).

E5 (EmbEddings from bidirEctional Encoder rEpresentations) from Microsoft Research introduced instruction-tuned embeddings. The key insight: prepending a task description to the input text lets a single model handle retrieval, classification, and clustering differently. e5-large-v2 at 1024 dimensions and the newer e5-mistral-7b-instruct (which uses a decoder architecture for embeddings) pushed the boundaries of what open models could achieve (Wang et al., 2022).

GTE (General Text Embeddings) from Alibaba DAMO Academy follows a similar multi-stage recipe and has performed competitively on benchmarks, particularly in multilingual settings.

nomic-embed-text from Nomic AI deserves attention for its emphasis on reproducibility and openness. The training data, code, and model weights are all publicly available. At 768 dimensions with a context length of 8192 tokens, it occupies an interesting middle ground between smaller sentence transformers and the larger instruction-tuned models.

Here is a rough landscape view:

Model                      | Dims   | Max Tokens | Type
···························|········|············|·············
text-embedding-3-small     | 1536   | 8191       | Commercial
text-embedding-3-large     | 3072   | 8191       | Commercial
Cohere embed-v3            | 1024   | 512        | Commercial
bge-large-en-v1.5          | 1024   | 512        | Open
bge-m3                     | 1024   | 8192       | Open
e5-large-v2                | 1024   | 512        | Open
e5-mistral-7b-instruct     | 4096   | 32768      | Open
gte-large-en-v1.5          | 1024   | 8192       | Open
nomic-embed-text-v1.5      | 768    | 8192       | Open

The field moves fast. By the time you read this, new entries will have appeared on the MTEB leaderboard. The specific rankings matter less than understanding what differentiates these models and how to evaluate them for your particular use case.

What the Benchmarks Actually Measure

The Massive Text Embedding Benchmark (MTEB), introduced by Muennighoff et al. (2022), was a landmark contribution. Before MTEB, comparing embedding models meant cherry-picking from inconsistent evaluation setups. MTEB standardized evaluation across seven task categories: classification, clustering, pair classification, reranking, retrieval, semantic textual similarity (STS), and summarization. It covers 58 datasets spanning 112 languages.

The MTEB leaderboard became the de facto scoreboard for the field, with model authors optimizing for it and practitioners citing its rankings as the comparison of record. It is genuinely useful.

It is also insufficient for RAG evaluation, in ways that matter.

First, MTEB retrieval tasks use established IR benchmarks like MS MARCO, Natural Questions, and BEIR. These are general-domain datasets with relatively clean, well-formed queries. Your production queries will not look like this. Users misspell terms, use domain jargon, ask ambiguous questions, and provide fragments rather than complete sentences. A model that excels on "What is the capital of France?" may struggle with "cap france" or "that city where the Eiffel Tower is."

Second, MTEB measures retrieval quality in isolation. In a RAG system, retrieval is the first stage of a pipeline. What matters is whether the retrieved passages contain information the language model can use to generate a correct answer. A passage might score high on relevance metrics but contain information in a format the LLM cannot easily extract. This interaction effect is invisible to embedding-only benchmarks.

Third, domain specificity. MTEB's datasets skew toward general knowledge, Wikipedia-style text, and web content. If you are building a RAG system for legal documents, medical records, or semiconductor datasheets, the benchmark scores may not predict your system's performance at all. Domain-specific vocabulary, document structure, and query patterns can dramatically shift the relative ranking of models.

MTEB benchmarks the gray column; production RAG lives in the blue one.

The practical implication: use MTEB as a starting shortlist, not a final answer. Pick the top five or six models from the leaderboard, then evaluate them on your actual data with your actual queries. The model that ranks first on MTEB may rank third on your domain. That is not a failure of benchmarking; it is a reminder that benchmarks measure what they measure.

The Dimension Question

Embedding dimensionality is one of those parameters that seems purely technical until you encounter its practical consequences. The number of dimensions in your embedding vectors affects three things simultaneously: retrieval quality, storage costs, and latency.

Quality vs. Dimensions

More dimensions give the model more room to encode fine-grained semantic distinctions. In a 384-dimensional space, the model must compress all the nuance of language into 384 numbers per text. In a 3072-dimensional space, it has eight times the capacity.

But the relationship between dimensions and quality is not linear. The first few hundred dimensions carry the bulk of the semantic information. Subsequent dimensions encode increasingly subtle distinctions. Going from 384 to 768 dimensions typically produces a measurable improvement in retrieval quality. Going from 1536 to 3072 produces a smaller improvement. Going from 3072 to 6144 would produce a negligible one for most tasks.

This is why Matryoshka representation learning (used in OpenAI's text-embedding-3 models) works. The training procedure encourages the model to front-load important information into the first dimensions. You can truncate a 3072-dimensional vector to 1024 dimensions and retain most of the retrieval quality, because the model learned to put the most discriminative features first.

Storage and Cost

Each dimension is typically stored as a 32-bit float, so the storage cost scales linearly with dimension count. At ten million vectors, the choice of dimensionality compounds quickly. Hover any row in the chart below to see the per-vector cost.

Doubling dimensions doubles storage. At small scale this is irrelevant: ten thousand documents fit in a few megabytes regardless of dimensionality. At ten million vectors, the difference between 384 and 3072 dimensions is the difference between an index that fits in RAM on a single machine and one that requires sharding across distributed infrastructure.

Latency

Vector similarity search scales linearly with dimension count for brute-force comparisons. Approximate nearest neighbor (ANN) algorithms reduce this dependency, but higher dimensions still impose a cost. The distance calculation itself takes longer. The index structures consume more memory. Quantization (reducing precision from float32 to int8 or binary) can offset this, at the cost of some retrieval accuracy.

For most production RAG systems, 768 or 1024 dimensions represent the practical sweet spot. High enough to capture meaningful semantic distinctions. Low enough to keep storage and latency manageable. If you need more and can afford the infrastructure, 1536 dimensions offer diminishing but real improvements. Beyond that, you are in specialist territory.

When General-Purpose Models Are Enough

Here is the claim that will save you weeks of engineering time: for most RAG applications, a general-purpose embedding model is sufficient.

This is counterintuitive. If you are building a system for medical literature retrieval, surely you need a medical embedding model? If your documents contain legal contracts, should you not fine-tune on legal text?

Often, no. And the reason goes back to how these models are trained.

Modern embedding models like E5, BGE, and text-embedding-3 are trained on extraordinarily diverse corpora. They have seen medical papers, legal briefs, technical documentation, and financial reports during training. The vocabulary and patterns of these domains are already encoded in their vector spaces, even if no domain-specific fine-tuning was performed.

General-purpose models tend to be sufficient when three conditions hold:

Your vocabulary overlaps substantially with standard English. If your documents use common words in their standard meanings, a general model already maps them to the right regions of the vector space. A medical article about "myocardial infarction" uses specialized vocabulary, but the surrounding text ("patients," "treatment," "risk factors") is thoroughly general.
Your queries are natural language. Users asking "What are the side effects of metformin?" are writing text that closely resembles the training data. The query-document matching problem is well within the model's learned capability.
Your quality bar is "good enough for LLM synthesis." RAG does not require perfect retrieval. It requires that the top-k retrieved passages contain enough relevant information for the language model to generate a useful answer. If three of your top five passages are relevant, the LLM can usually work with that.

I have seen teams spend months fine-tuning embedding models only to discover that the general-purpose baseline was within two percentage points of their custom model on end-to-end answer quality. The engineering time would have been better spent on chunking strategies, prompt engineering, or reranking.

Start with a general-purpose model. Measure end-to-end performance. Only invest in fine-tuning if the measurement shows a meaningful gap.

Fine-Tuning Embeddings

Sometimes the gap is real. When it is, fine-tuning the embedding model can produce dramatic improvements. Understanding when and how matters.

When Fine-Tuning Matters

Fine-tuning becomes necessary in specific, identifiable situations:

Domain-specific vocabulary with non-standard meanings. In semiconductor manufacturing, "wafer" does not mean a thin biscuit. In legal contracts, "consideration" has a precise technical meaning unrelated to thoughtfulness. When your domain repurposes common words, general models map them to the wrong region of vector space. Fine-tuning moves them to the right neighborhood.

Specialized abbreviations and jargon. If your corpus is full of terms like "CYP3A4," "10-K filing," or "ASIL-D compliance," the general model may not have seen enough examples during pre-training to produce meaningful embeddings. Fine-tuning teaches the model what these terms mean in your context.

Non-standard query patterns. If your users issue queries that look fundamentally different from web search queries (structured codes, part numbers, formulaic expressions), the model's learned query-document mapping may be miscalibrated.

When two-point accuracy improvements matter. In high-stakes applications (medical diagnosis support, legal discovery), even small improvements in retrieval precision can have significant downstream consequences. Fine-tuning is worth it when the cost of missed retrievals is high.

How Contrastive Learning Works

The dominant fine-tuning approach for embedding models is contrastive learning. The core idea is simple: teach the model which texts should be close together and which should be far apart.

You provide training examples as pairs or triplets:

# Positive pair: query + relevant passage
("What causes diabetic retinopathy?",
 "Chronic hyperglycemia damages retinal blood vessels...")

# Negative pair: query + irrelevant passage
("What causes diabetic retinopathy?",
 "The retina is a thin layer of tissue lining the back...")

The training objective pushes positive pairs closer together in vector space and negative pairs further apart. The loss function (typically InfoNCE or a variant) looks like this conceptually:

def contrastive_loss(query, positive, negatives, temperature=0.05):
    """
    Push query toward positive, away from negatives.
    Temperature controls how sharply the model discriminates.
    """
    # Cosine similarities
    pos_sim = cosine_similarity(query, positive)
    neg_sims = [cosine_similarity(query, neg) for neg in negatives]

    # Softmax-style normalization
    numerator = exp(pos_sim / temperature)
    denominator = numerator + sum(exp(s / temperature) for s in neg_sims)

    return -log(numerator / denominator)

The temperature parameter controls discrimination sharpness. Lower temperatures make the model more decisive: it must push positives very close and negatives very far. Higher temperatures are more forgiving.

The Importance of Hard Negatives

Not all negative examples are equally useful. A negative passage about cooking recipes is trivially distinguishable from a query about diabetic retinopathy. The model learns nothing from these easy cases.

Hard negatives are passages that are superficially similar to the positive but actually irrelevant. For our diabetic retinopathy query, a hard negative might be a passage about retinal anatomy that does not discuss diabetes, or a passage about diabetes management that does not mention eye complications.

Mining hard negatives is itself a skill. Common approaches include:

BM25 negatives: Use lexical search to find passages that share keywords with the query but are not relevant. These are hard because they overlap in vocabulary.
Embedding negatives: Use the current model to find passages that are close in vector space but not relevant. These are the hardest negatives, the ones the model currently gets wrong.
Cross-encoder negatives: Use a more powerful cross-encoder to score candidates and select passages that the bi-encoder ranks highly but the cross-encoder ranks low.

The quality of your hard negatives often matters more than the quantity of your training data.

How Much Data Do You Need?

Less than you think, more than zero.

Sentence-BERT (Reimers & Gurevych, 2019) demonstrated that fine-tuning with as few as 1,000 labeled pairs could produce meaningful improvements on domain-specific tasks. Recent work suggests that 5,000 to 10,000 high-quality query-passage pairs is a reasonable target for most domains.

The key word is "high-quality." Ten thousand pairs where annotators carefully verified relevance will outperform one hundred thousand pairs generated by heuristic matching. And 5,000 pairs with well-mined hard negatives will outperform 10,000 pairs with random negatives.

Here is a practical recipe for generating training data:

↗ docs"""
Generate training pairs for embedding fine-tuning.

Strategy:
1. Use an LLM to generate synthetic queries for your passages
2. Use BM25 to mine hard negatives
3. Filter with a cross-encoder for quality
"""
from sentence_transformers import SentenceTransformer, InputExample
from sentence_transformers.losses import MultipleNegativesRankingLoss

# Step 1: Generate synthetic queries from your passages
def generate_query(passage, llm_client):
    prompt = f"""Given the following passage, generate a natural
    question that this passage would answer.

    Passage: {passage}

    Question:"""
    return llm_client.complete(prompt)

# Step 2: Create training examples
train_examples = []
for passage in corpus:
    query = generate_query(passage, llm_client)
    train_examples.append(
        InputExample(texts=[query, passage])
    )

# Step 3: Fine-tune with contrastive loss
model = SentenceTransformer("BAAI/bge-base-en-v1.5")
train_loss = MultipleNegativesRankingLoss(model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=100,
)

The synthetic query generation step is where LLMs have transformed the fine-tuning workflow. Instead of hiring annotators to write queries, you can use GPT-4 or Claude to generate plausible queries for each passage in your corpus. This is not a substitute for real user queries, but it is a remarkably effective bootstrap.

Domain Adaptation Without Fine-Tuning

Fine-tuning is effective but requires infrastructure, training data, and iteration cycles. Several techniques achieve partial domain adaptation without modifying model weights at all.

Instruction-Prefixed Models

Models like E5 and BGE support instruction prefixes that guide embedding behavior. Instead of passing raw text, you prepend a task description:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("intfloat/e5-large-v2")

# For queries: prepend "query: "
query_embedding = model.encode(
    "query: What are the side effects of metformin?"
)

# For passages: prepend "passage: "
passage_embedding = model.encode(
    "passage: Common side effects of metformin include "
    "gastrointestinal symptoms such as nausea, diarrhea, "
    "and abdominal discomfort..."
)

This prefix mechanism exploits the model's instruction-following capability to adjust behavior at inference time. The "query:" prefix tells the model to produce an embedding optimized for matching against passages. The "passage:" prefix tells it to produce an embedding optimized for being found by queries. These are different optimization targets, and separating them improves retrieval.

BGE models use a slightly different convention:

↗ docs# BGE uses an instruction prefix for queries only
query_embedding = model.encode(
    "Represent this sentence for searching relevant passages: "
    "What are the side effects of metformin?"
)

# Passages are encoded without a prefix
passage_embedding = model.encode(
    "Common side effects of metformin include "
    "gastrointestinal symptoms..."
)

Query-Passage Asymmetry

The asymmetry between queries and passages is more than a prefix trick. It reflects a real structural difference in how queries and documents function in retrieval.

Queries are typically short, specific, and express an information need. "What causes memory leaks in Python?" is 7 words. The relevant passage might be 200 words explaining reference counting, circular references, and the gc module. The embedding model must map these two very different texts to nearby points in vector space.

This is hard. A model trained to put semantically similar texts close together will naturally embed the query near other short questions about Python, not near long explanatory passages about memory management. Query-passage asymmetry addresses this by allowing the model to produce different embedding distributions for queries and documents.

Cohere's input_type parameter makes this explicit:

↗ docsimport cohere

co = cohere.Client("your-api-key")

# Embed the query with search_query type
query_response = co.embed(
    texts=["What causes memory leaks in Python?"],
    model="embed-english-v3.0",
    input_type="search_query"
)

# Embed documents with search_document type
doc_response = co.embed(
    texts=[
        "Memory leaks in Python typically occur when objects "
        "maintain references that prevent garbage collection...",
        "The gc module provides an interface to the garbage "
        "collector, allowing you to inspect reference cycles..."
    ],
    model="embed-english-v3.0",
    input_type="search_document"
)

If you are using a model that supports these asymmetric modes and you are not using them, you are leaving retrieval quality on the table. This is one of the most common and most easily fixed mistakes in RAG implementations.

Prompt Engineering for Embeddings

A less discussed technique is adjusting the text you feed to the embedding model. This is not about changing the model; it is about changing the input.

Consider a chunk from a legal document:

Original chunk:
"Section 4.2(b). Notwithstanding any provision herein to the
contrary, the Indemnifying Party shall not be liable for any
Losses to the extent arising from the Indemnified Party's
gross negligence or willful misconduct."

A user might query: "Who is responsible if someone is grossly negligent?" The embedding model must bridge from casual language to legal prose. You can help by enriching the chunk before embedding:

Enriched chunk:
"Indemnification limitation for negligence. This section
addresses liability exceptions. Section 4.2(b). Notwithstanding
any provision herein to the contrary, the Indemnifying Party
shall not be liable for any Losses to the extent arising from
the Indemnified Party's gross negligence or willful misconduct."

The prepended summary in plain language creates additional semantic hooks for the embedding model. This technique, sometimes called "contextual chunk enrichment," bridges the vocabulary gap without any model modification.

Putting It All Together: A Decision Framework

Given the complexity of the landscape, here is a practical framework for embedding model selection in RAG systems. It is deliberately opinionated.

An opinionated path: a strong default, an honest baseline, and a hierarchy of cheap fixes before any fine-tuning investment.

Step 1: Start with a Strong Default

Pick one of these and build your entire pipeline around it:

If you want a commercial API: OpenAI text-embedding-3-small at 1536 dimensions. Good quality, reasonable cost, and the Matryoshka property lets you reduce dimensions later if needed.
If you want open-source: BAAI/bge-large-en-v1.5 or intfloat/e5-large-v2 at 1024 dimensions. Proven, well-documented, and you control the infrastructure.
If you need multilingual: BAAI/bge-m3 or Cohere embed-v3 with multilingual support.

Step 2: Build Your Evaluation Set

Before optimizing anything, create a test set of 50 to 100 query-passage pairs that represent your actual use case. These should include:

Real queries from users (or realistic synthetic ones)
The passages you expect the system to retrieve
A few adversarial cases where similar-looking passages are not relevant

Measure retrieval quality with Recall@k (what fraction of relevant passages appear in the top k results) and NDCG@k (which also considers ranking position). If you do not have an evaluation set, you do not have a basis for any optimization decision.

Step 3: Measure the Baseline

Run your evaluation set through the default model. Record the metrics. This is your baseline.

"""
Minimal RAG evaluation with retrieval metrics.
"""
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("BAAI/bge-large-en-v1.5")

def evaluate_retrieval(queries, relevant_passages, corpus, k=5):
    """
    Compute Recall@k for a set of query-passage pairs.
    """
    # Encode corpus once
    corpus_embeddings = model.encode(corpus)

    recalls = []
    for query, expected_ids in zip(queries, relevant_passages):
        query_embedding = model.encode(query)

        # Compute similarities
        similarities = np.dot(corpus_embeddings, query_embedding)
        top_k_ids = np.argsort(similarities)[-k:][::-1]

        # Check recall
        hits = len(set(top_k_ids) & set(expected_ids))
        recall = hits / len(expected_ids)
        recalls.append(recall)

    return np.mean(recalls)

# Example usage
recall = evaluate_retrieval(
    queries=test_queries,
    relevant_passages=test_relevant_ids,
    corpus=document_chunks,
    k=5
)
print(f"Recall@5: {recall:.3f}")

Step 4: Try Low-Hanging Fruit First

Before fine-tuning, exhaust these cheaper optimizations:

Use query-passage prefixes if your model supports them. This alone can improve recall by 5-15%.
Adjust chunking strategy. Chunks that are too large dilute the embedding. Chunks that are too small lose context. Experiment with 256, 512, and 1024 token chunks with appropriate overlap.
Add a reranker. A cross-encoder reranker (like cross-encoder/ms-marco-MiniLM-L-12-v2) that rescores the top 20-50 results from your bi-encoder can dramatically improve precision. This is often more impactful than changing the embedding model.
Enrich chunks with contextual metadata. Prepend document titles, section headers, or LLM-generated summaries to your chunks before embedding.

Step 5: Fine-Tune Only If Justified

If you have exhausted the above and your retrieval quality still falls short, fine-tuning is the next step. Follow the contrastive learning recipe described earlier. Start with synthetic queries generated by an LLM, mine hard negatives from your existing retrieval failures, and iterate.

Expect the process to take one to two weeks of focused engineering effort for a first iteration. Budget for three to five iterations before the model stabilizes.

Step 6: Monitor in Production

Embedding quality degrades over time as your corpus evolves and user query patterns shift. Build logging that captures retrieval results alongside user feedback signals (clicks, thumbs up/down, follow-up queries). Use this data to detect drift and to generate new fine-tuning examples for future iterations.

Drift in a production retrieval store does not announce itself with an error. The same query continues to return ten results, the API keeps responding, and only the relative ordering of those results shifts as the corpus and the model behind it evolve. The companion demo below simulates that scenario on a small synthetic corpus, exposing four detectors that surface different mechanisms of change: centroid distance and spatial KL respond when the population shifts, score-distribution PSI responds when the embedding function itself changes, and recall@K compared against a reference query suite responds to either. The auto-clicked controls walk through corpus growth, a model version push, and a Drift-Adapter mitigation in turn, so the detector signatures can be read against each cause.

Interactive: a simulated RAG store and the four detectors that watch it; press Next in the banner to step through each drift mechanism in turn. Open in a new tab.

A Complete Working Example

Let us put the pieces together with a complete, runnable example that demonstrates embedding model usage in a minimal RAG pipeline:

"""
Minimal RAG pipeline demonstrating embedding model selection
and query-passage asymmetry.

Requirements:
    pip install sentence-transformers numpy
"""
from sentence_transformers import SentenceTransformer
import numpy as np
from typing import List, Tuple


def build_index(
    chunks: List[str],
    model_name: str = "BAAI/bge-large-en-v1.5",
    prefix: str = ""
) -> Tuple[SentenceTransformer, np.ndarray]:
    """
    Encode document chunks into an embedding matrix.

    Args:
        chunks: List of text passages to index
        model_name: HuggingFace model identifier
        prefix: Optional prefix for passage encoding

    Returns:
        Tuple of (model, embedding_matrix)
    """
    model = SentenceTransformer(model_name)

    # Apply prefix if specified (e.g., for BGE models)
    texts = [prefix + chunk for chunk in chunks]

    embeddings = model.encode(
        texts,
        normalize_embeddings=True,  # L2 normalize for cosine sim
        show_progress_bar=True,
        batch_size=32
    )

    return model, np.array(embeddings)


def retrieve(
    query: str,
    model: SentenceTransformer,
    index: np.ndarray,
    chunks: List[str],
    k: int = 5,
    query_prefix: str = ""
) -> List[Tuple[str, float]]:
    """
    Retrieve top-k passages for a query.

    Returns list of (passage, similarity_score) tuples.
    """
    query_embedding = model.encode(
        query_prefix + query,
        normalize_embeddings=True
    )

    # Cosine similarity (embeddings are normalized)
    similarities = np.dot(index, query_embedding)

    top_k = np.argsort(similarities)[-k:][::-1]

    return [
        (chunks[i], float(similarities[i]))
        for i in top_k
    ]


# ---- Example usage ----

# Sample corpus
chunks = [
    "Python's garbage collector uses reference counting as its "
    "primary mechanism. Each object maintains a count of "
    "references pointing to it. When the count drops to zero, "
    "the memory is immediately freed.",

    "Circular references occur when two or more objects "
    "reference each other, preventing their reference counts "
    "from reaching zero. Python's gc module detects and "
    "collects these cycles periodically.",

    "Memory leaks in Python often stem from unintentional "
    "references held in global variables, class attributes, "
    "or closures that capture large objects.",

    "The tracemalloc module, introduced in Python 3.4, "
    "provides detailed memory allocation traces that help "
    "developers identify the source of memory leaks.",

    "Java uses a mark-and-sweep garbage collector that "
    "periodically identifies and frees unreachable objects "
    "from the heap.",
]

# Build index with BGE prefix convention
query_prefix = (
    "Represent this sentence for searching relevant passages: "
)

model, index = build_index(chunks)

# Retrieve
results = retrieve(
    query="What causes memory leaks in Python?",
    model=model,
    index=index,
    chunks=chunks,
    k=3,
    query_prefix=query_prefix
)

for passage, score in results:
    print(f"[{score:.3f}] {passage[:80]}...")

This example is deliberately minimal. A production system would add persistent storage (a vector database like Pinecone, Weaviate, or pgvector), batch processing, caching, and error handling. But the core logic remains: encode passages, encode queries (with appropriate prefixes), compute similarity, return the top results.

Looking Forward

The embedding model landscape is evolving in several directions simultaneously.

Late interaction models like ColBERT store per-token embeddings rather than a single vector per passage. This preserves fine-grained matching information at the cost of larger index sizes. For domains where exact term matching matters (legal, medical), the tradeoff is often worthwhile.

Multimodal embeddings are beginning to unify text, images, and structured data into shared vector spaces. Models like CLIP demonstrated this for text-image pairs; newer work extends the principle to tables, code, and domain-specific formats.

Longer context windows in embedding models (8192 tokens and beyond) reduce the need for aggressive chunking. If you can embed an entire document section as a single vector, you avoid the information loss that comes with splitting text at arbitrary boundaries.

Binary and quantized embeddings reduce storage requirements by orders of magnitude. Cohere's binary embeddings compress 1024 float32 dimensions into 128 bytes, a 32x reduction. The retrieval quality loss is modest for most applications, especially when combined with a reranking stage.

But the fundamental insight has not changed since Firth. Meaning lives in relationships. Embedding models encode those relationships as geometry. The quality of your RAG system depends on the fidelity of that encoding.

Choose the model carefully. Evaluate it honestly. And do not fine-tune until you have exhausted the simpler options.

. . .

References

Muennighoff, N., Tazi, N., Magne, L., & Reimers, N. (2022). "MTEB: Massive Text Embedding Benchmark." arXiv preprint arXiv:2210.07316.
Neelakantan, A., Xu, T., Puri, R., Radford, A., Han, J. M., Tworek, J., et al. (2022). "Text and Code Embeddings by Contrastive Pre-Training." arXiv preprint arXiv:2201.10005.
Wang, L., Yang, N., Huang, X., Jiao, B., Yang, L., Jiang, D., et al. (2022). "Text Embeddings by Weakly-Supervised Contrastive Pre-training." arXiv preprint arXiv:2212.03533.
Xiao, S., Liu, Z., Zhang, P., & Muennighoff, N. (2023). "C-Pack: Packaged Resources To Advance General Chinese Embedding." arXiv preprint arXiv:2310.07554.
Reimers, N. & Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." Proceedings of EMNLP-IJCNLP 2019.
Firth, J. R. (1957). "A Synopsis of Linguistic Theory, 1930-1955." Studies in Linguistic Analysis, Philological Society, Oxford.
Khattab, O. & Zaharia, M. (2020). "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT." Proceedings of SIGIR 2020.
Kusupati, A., Bhatt, G., Rege, A., Wallingford, M., Sinha, A., Sapber, R., et al. (2022). "Matryoshka Representation Learning." Advances in Neural Information Processing Systems 35.

Embeddings RAG Dense Retrieval Vector Search MTEB Model Selection