Vector RAG
Embedding models, vector indexes, and chunking are three layers of one stack. The decisions in each layer cascade into the others, and the production system that gets vector retrieval right is the one that treats the three as a single design problem. This is the full walk through that stack: the embedding model and how it gets evaluated, what the vector database is doing underneath the three-line tutorial, and where chunking decisions silently cap the recall of everything downstream.
The structure of this article mirrors the structure of the retrieval system itself. Part 1 covers the embedding model, the function that turns text into points in a high-dimensional space. Part 2 covers the database that stores and searches those points. Part 3 covers chunking, the upstream decision that determines what text becomes a vector in the first place. Part 4 (which is the section the merger of three previously separate articles earned) covers the cascade: how a choice in any one layer changes what is possible in the other two, and why these three decisions are not independent in production.
The Embedding Layer
Vector similarity is only as good as the model that produces the vectors, and most dense-retrieval failures trace back to this first decision. This part walks the current landscape (OpenAI, Cohere, BGE, E5, MTEB), what the benchmarks actually measure, the dimension question, fine-tuning mechanics, and a six-step decision framework for picking an embedding model and knowing whether to fine-tune it.
If you have read the earlier articles in this series, you understand the geometry. Words become vectors. Similar meanings cluster. Cosine similarity measures the angle between them. The distributional hypothesis, first articulated by Firth in 1957, underpins the entire edifice: words that appear in similar contexts develop similar representations.
That is the theory. The practice looks different.
When you sit down to build a retrieval-augmented generation system, the first concrete decision you face is which embedding model to use. Not which architecture is theoretically elegant. Not which paper introduced the most novel training objective. Which model, right now, will turn your documents and queries into vectors that actually retrieve the right passages.
This decision is consequential. A retrieval system that returns irrelevant passages forces the language model to hallucinate or hedge, regardless of how capable that model is. The embedding model is the foundation. Everything downstream depends on it.
The Current Landscape
The embedding model ecosystem has expanded dramatically since 2022. Where once you had Word2Vec and maybe Sentence-BERT, you now face a crowded field of commercial APIs, open-source models, and specialized variants. Understanding the major players is the first step toward an informed choice.
Commercial APIs
OpenAI text-embedding-3 ships in two variants: text-embedding-3-small (1536 dimensions, cheaper) and text-embedding-3-large (3072 dimensions, more capable). Both support Matryoshka representation learning, meaning you can truncate the output to fewer dimensions (256, 512, 1024) with graceful quality degradation rather than catastrophic collapse. This is a practical feature: you can tune the cost-quality tradeoff after model selection, without retraining anything.
Cohere embed-v3 introduced explicit input type parameters: search_document, search_query, classification, and clustering. The model adjusts its internal behavior based on which type you specify. This is not cosmetic. A query like "What causes memory leaks in Python?" and a passage explaining garbage collection serve different retrieval roles; encoding that asymmetry into the model improves recall. Cohere also supports 1024 dimensions by default and offers compression to binary or integer embeddings for storage efficiency.
Google's Gecko (part of the Vertex AI family) and various models available through Amazon Bedrock round out the commercial options. Each API has its own pricing model, rate limits, and dimension choices.
Open-Source Models
The open-source landscape is where the real action has been. Several families of models have emerged, each with distinct training strategies.
BGE (BAAI General Embedding) from the Beijing Academy of Artificial Intelligence uses a multi-stage training pipeline: pre-training on large-scale unsupervised data, then fine-tuning with contrastive learning on curated pairs. The bge-large-en-v1.5 model at 1024 dimensions has been a workhorse for production systems. The newer bge-m3 supports multi-lingual, multi-granularity, and multi-functionality embedding in a single model (Xiao et al., 2023).
E5 (EmbEddings from bidirEctional Encoder rEpresentations) from Microsoft Research introduced instruction-tuned embeddings. The key insight: prepending a task description to the input text lets a single model handle retrieval, classification, and clustering differently. e5-large-v2 at 1024 dimensions and the newer e5-mistral-7b-instruct (which uses a decoder architecture for embeddings) pushed the boundaries of what open models could achieve (Wang et al., 2022).
GTE (General Text Embeddings) from Alibaba DAMO Academy follows a similar multi-stage recipe and has performed competitively on benchmarks, particularly in multilingual settings.
nomic-embed-text from Nomic AI deserves attention for its emphasis on reproducibility and openness. The training data, code, and model weights are all publicly available. At 768 dimensions with a context length of 8192 tokens, it occupies an interesting middle ground between smaller sentence transformers and the larger instruction-tuned models.
Here is a rough landscape view:
Model | Dims | Max Tokens | Type ···························|········|············|············· text-embedding-3-small | 1536 | 8191 | Commercial text-embedding-3-large | 3072 | 8191 | Commercial Cohere embed-v3 | 1024 | 512 | Commercial bge-large-en-v1.5 | 1024 | 512 | Open bge-m3 | 1024 | 8192 | Open e5-large-v2 | 1024 | 512 | Open e5-mistral-7b-instruct | 4096 | 32768 | Open gte-large-en-v1.5 | 1024 | 8192 | Open nomic-embed-text-v1.5 | 768 | 8192 | Open
The field moves fast. By the time you read this, new entries will have appeared on the MTEB leaderboard. The specific rankings matter less than understanding what differentiates these models and how to evaluate them for your particular use case.
What the Benchmarks Actually Measure
The Massive Text Embedding Benchmark (MTEB), introduced by Muennighoff et al. (2022), was a landmark contribution. Before MTEB, comparing embedding models meant cherry-picking from inconsistent evaluation setups. MTEB standardized evaluation across seven task categories: classification, clustering, pair classification, reranking, retrieval, semantic textual similarity (STS), and summarization. It covers 58 datasets spanning 112 languages.
The MTEB leaderboard became the de facto scoreboard for the field, with model authors optimizing for it and practitioners citing its rankings as the comparison of record. It is genuinely useful.
It is also insufficient for RAG evaluation, in ways that matter.
First, MTEB retrieval tasks use established IR benchmarks like MS MARCO, Natural Questions, and BEIR. These are general-domain datasets with relatively clean, well-formed queries. Your production queries will not look like this. Users misspell terms, use domain jargon, ask ambiguous questions, and provide fragments rather than complete sentences. A model that excels on "What is the capital of France?" may struggle with "cap france" or "that city where the Eiffel Tower is."
Second, MTEB measures retrieval quality in isolation. In a RAG system, retrieval is the first stage of a pipeline. What matters is whether the retrieved passages contain information the language model can use to generate a correct answer. A passage might score high on relevance metrics but contain information in a format the LLM cannot easily extract. This interaction effect is invisible to embedding-only benchmarks.
Third, domain specificity. MTEB's datasets skew toward general knowledge, Wikipedia-style text, and web content. If you are building a RAG system for legal documents, medical records, or semiconductor datasheets, the benchmark scores may not predict your system's performance at all. Domain-specific vocabulary, document structure, and query patterns can dramatically shift the relative ranking of models.
The gray column is what MTEB benchmarks; the blue column is what production RAG actually lives in. The practical implication: use MTEB as a starting shortlist, not a final answer. Pick the top five or six models from the leaderboard, then evaluate them on your actual data with your actual queries. The model that ranks first on MTEB may rank third on your domain. That is not a failure of benchmarking; it is a reminder that benchmarks measure what they measure.
The Dimension Question
Embedding dimensionality is one of those parameters that seems purely technical until you encounter its practical consequences. The number of dimensions in your embedding vectors affects three things simultaneously: retrieval quality, storage costs, and latency.
Quality vs. Dimensions
More dimensions give the model more room to encode fine-grained semantic distinctions. In a 384-dimensional space, the model must compress all the nuance of language into 384 numbers per text. In a 3072-dimensional space, it has eight times the capacity.
But the relationship between dimensions and quality is not linear. The first few hundred dimensions carry the bulk of the semantic information. Subsequent dimensions encode increasingly subtle distinctions. Going from 384 to 768 dimensions typically produces a measurable improvement in retrieval quality. Going from 1536 to 3072 produces a smaller improvement. Going from 3072 to 6144 would produce a negligible one for most tasks.
This is why Matryoshka representation learning (used in OpenAI's text-embedding-3 models) works. The training procedure encourages the model to front-load important information into the first dimensions. You can truncate a 3072-dimensional vector to 1024 dimensions and retain most of the retrieval quality, because the model learned to put the most discriminative features first.
Storage and Cost
Each dimension is typically stored as a 32-bit float, so the storage cost scales linearly with dimension count. At ten million vectors stored as float32 with no quantization, the choice of dimensionality compounds quickly. Hover any row in the chart below to see the per-vector cost.
Doubling dimensions doubles storage. At small scale this is irrelevant: ten thousand documents fit in a few megabytes regardless of dimensionality. At ten million vectors, the difference between 384 and 3072 dimensions is the difference between an index that fits in RAM on a single machine and one that requires sharding across distributed infrastructure.
Latency
Vector similarity search scales linearly with dimension count for brute-force comparisons. Approximate nearest neighbor (ANN) algorithms reduce this dependency, but higher dimensions still impose a cost. The distance calculation itself takes longer. The index structures consume more memory. Quantization (reducing precision from float32 to int8 or binary) can offset this, at the cost of some retrieval accuracy.
For most production RAG systems, 768 or 1024 dimensions represent the practical sweet spot. High enough to capture meaningful semantic distinctions. Low enough to keep storage and latency manageable. If you need more and can afford the infrastructure, 1536 dimensions offer diminishing but real improvements. Beyond that, you are in specialist territory.
When General-Purpose Models Are Enough
Here is the claim that will save you weeks of engineering time: for most RAG applications, a general-purpose embedding model is sufficient.
This is counterintuitive. If you are building a system for medical literature retrieval, surely you need a medical embedding model? If your documents contain legal contracts, should you not fine-tune on legal text?
Often, no. And the reason goes back to how these models are trained.
Modern embedding models like E5, BGE, and text-embedding-3 are trained on extraordinarily diverse corpora. They have seen medical papers, legal briefs, technical documentation, and financial reports during training. The vocabulary and patterns of these domains are already encoded in their vector spaces, even if no domain-specific fine-tuning was performed.
General-purpose models tend to be sufficient when three conditions hold:
- Your vocabulary overlaps substantially with standard English. If your documents use common words in their standard meanings, a general model already maps them to the right regions of the vector space. A medical article about "myocardial infarction" uses specialized vocabulary, but the surrounding text ("patients," "treatment," "risk factors") is thoroughly general.
- Your queries are natural language. Users asking "What are the side effects of metformin?" are writing text that closely resembles the training data. The query-document matching problem is well within the model's learned capability.
- Your quality bar is "good enough for LLM synthesis." RAG does not require perfect retrieval. It requires that the top-k retrieved passages contain enough relevant information for the language model to generate a useful answer. If three of your top five passages are relevant, the LLM can usually work with that.
I have seen teams spend months fine-tuning embedding models only to discover that the general-purpose baseline was within two percentage points of their custom model on end-to-end answer quality. The engineering time would have been better spent on chunking strategies, prompt engineering, or reranking.
Start with a general-purpose model. Measure end-to-end performance. Only invest in fine-tuning if the measurement shows a meaningful gap.
Fine-Tuning Embeddings
Sometimes the gap is real. When it is, fine-tuning the embedding model can produce dramatic improvements. Understanding when and how matters.
When Fine-Tuning Matters
Fine-tuning becomes necessary in specific, identifiable situations:
Domain-specific vocabulary with non-standard meanings. In semiconductor manufacturing, "wafer" does not mean a thin biscuit. In legal contracts, "consideration" has a precise technical meaning unrelated to thoughtfulness. When your domain repurposes common words, general models map them to the wrong region of vector space. Fine-tuning moves them to the right neighborhood.
Specialized abbreviations and jargon. If your corpus is full of terms like "CYP3A4," "10-K filing," or "ASIL-D compliance," the general model may not have seen enough examples during pre-training to produce meaningful embeddings. Fine-tuning teaches the model what these terms mean in your context.
Non-standard query patterns. If your users issue queries that look fundamentally different from web search queries (structured codes, part numbers, formulaic expressions), the model's learned query-document mapping may be miscalibrated.
When two-point accuracy improvements matter. In high-stakes applications (medical diagnosis support, legal discovery), even small improvements in retrieval precision can have significant downstream consequences. Fine-tuning is worth it when the cost of missed retrievals is high.
How Contrastive Learning Works
The dominant fine-tuning approach for embedding models is contrastive learning. The core idea is simple: teach the model which texts should be close together and which should be far apart.
You provide training examples as pairs or triplets:
# Positive pair: query + relevant passage ("What causes diabetic retinopathy?", "Chronic hyperglycemia damages retinal blood vessels...") # Negative pair: query + irrelevant passage ("What causes diabetic retinopathy?", "The retina is a thin layer of tissue lining the back...")
The training objective pushes positive pairs closer together in vector space and negative pairs further apart. The loss function (typically InfoNCE or a variant) looks like this conceptually:
def contrastive_loss(query, positive, negatives, temperature=0.05):
"""
Push query toward positive, away from negatives.
Temperature controls how sharply the model discriminates.
"""
# Cosine similarities
pos_sim = cosine_similarity(query, positive)
neg_sims = [cosine_similarity(query, neg) for neg in negatives]
# Softmax-style normalization
numerator = exp(pos_sim / temperature)
denominator = numerator + sum(exp(s / temperature) for s in neg_sims)
return -log(numerator / denominator)
The temperature parameter controls discrimination sharpness. Lower temperatures make the model more decisive: it must push positives very close and negatives very far. Higher temperatures are more forgiving.
The Importance of Hard Negatives
Not all negative examples are equally useful. A negative passage about cooking recipes is trivially distinguishable from a query about diabetic retinopathy. The model learns nothing from these easy cases.
Hard negatives are passages that are superficially similar to the positive but actually irrelevant. For our diabetic retinopathy query, a hard negative might be a passage about retinal anatomy that does not discuss diabetes, or a passage about diabetes management that does not mention eye complications.
Mining hard negatives is itself a skill. Common approaches include:
- BM25 negatives: Use lexical search to find passages that share keywords with the query but are not relevant. These are hard because they overlap in vocabulary.
- Embedding negatives: Use the current model to find passages that are close in vector space but not relevant. These are the hardest negatives, the ones the model currently gets wrong.
- Cross-encoder negatives: Use a more powerful cross-encoder to score candidates and select passages that the bi-encoder ranks highly but the cross-encoder ranks low.
The quality of your hard negatives often matters more than the quantity of your training data.
How Much Data Do You Need?
Less than you think, more than zero.
Sentence-BERT (Reimers & Gurevych, 2019) demonstrated that fine-tuning with as few as 1,000 labeled pairs could produce meaningful improvements on domain-specific tasks. Recent work suggests that 5,000 to 10,000 high-quality query-passage pairs is a reasonable target for most domains.
The key word is "high-quality." Ten thousand pairs where annotators carefully verified relevance will outperform one hundred thousand pairs generated by heuristic matching. And 5,000 pairs with well-mined hard negatives will outperform 10,000 pairs with random negatives.
Here is a practical recipe for generating training data:
↗ docs"""
Generate training pairs for embedding fine-tuning.
Strategy:
1. Use an LLM to generate synthetic queries for your passages
2. Use BM25 to mine hard negatives
3. Filter with a cross-encoder for quality
"""
from sentence_transformers import SentenceTransformer, InputExample
from sentence_transformers.losses import MultipleNegativesRankingLoss
# Step 1: Generate synthetic queries from your passages
def generate_query(passage, llm_client):
prompt = f"""Given the following passage, generate a natural
question that this passage would answer.
Passage: {passage}
Question:"""
return llm_client.complete(prompt)
# Step 2: Create training examples
train_examples = []
for passage in corpus:
query = generate_query(passage, llm_client)
train_examples.append(
InputExample(texts=[query, passage])
)
# Step 3: Fine-tune with contrastive loss
model = SentenceTransformer("BAAI/bge-base-en-v1.5")
train_loss = MultipleNegativesRankingLoss(model)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=3,
warmup_steps=100,
)
The synthetic query generation step is where LLMs have transformed the fine-tuning workflow. Instead of hiring annotators to write queries, you can use GPT-4 or Claude to generate plausible queries for each passage in your corpus. This is not a substitute for real user queries, but it is a remarkably effective bootstrap.
Domain Adaptation Without Fine-Tuning
Fine-tuning is effective but requires infrastructure, training data, and iteration cycles. Several techniques achieve partial domain adaptation without modifying model weights at all.
Instruction-Prefixed Models
Models like E5 and BGE support instruction prefixes that guide embedding behavior. Instead of passing raw text, you prepend a task description:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("intfloat/e5-large-v2")
# For queries: prepend "query: "
query_embedding = model.encode(
"query: What are the side effects of metformin?"
)
# For passages: prepend "passage: "
passage_embedding = model.encode(
"passage: Common side effects of metformin include "
"gastrointestinal symptoms such as nausea, diarrhea, "
"and abdominal discomfort..."
)
This prefix mechanism exploits the model's instruction-following capability to adjust behavior at inference time. The "query:" prefix tells the model to produce an embedding optimized for matching against passages. The "passage:" prefix tells it to produce an embedding optimized for being found by queries. These are different optimization targets, and separating them improves retrieval.
BGE models use a slightly different convention:
↗ docs# BGE uses an instruction prefix for queries only
query_embedding = model.encode(
"Represent this sentence for searching relevant passages: "
"What are the side effects of metformin?"
)
# Passages are encoded without a prefix
passage_embedding = model.encode(
"Common side effects of metformin include "
"gastrointestinal symptoms..."
)
Query-Passage Asymmetry
The asymmetry between queries and passages is more than a prefix trick. It reflects a real structural difference in how queries and documents function in retrieval.
Queries are typically short, specific, and express an information need. "What causes memory leaks in Python?" is 7 words. The relevant passage might be 200 words explaining reference counting, circular references, and the gc module. The embedding model must map these two very different texts to nearby points in vector space.
This is hard. A model trained to put semantically similar texts close together will naturally embed the query near other short questions about Python, not near long explanatory passages about memory management. Query-passage asymmetry addresses this by allowing the model to produce different embedding distributions for queries and documents.
Cohere's input_type parameter makes this explicit:
↗ docsimport cohere
co = cohere.Client("your-api-key")
# Embed the query with search_query type
query_response = co.embed(
texts=["What causes memory leaks in Python?"],
model="embed-english-v3.0",
input_type="search_query"
)
# Embed documents with search_document type
doc_response = co.embed(
texts=[
"Memory leaks in Python typically occur when objects "
"maintain references that prevent garbage collection...",
"The gc module provides an interface to the garbage "
"collector, allowing you to inspect reference cycles..."
],
model="embed-english-v3.0",
input_type="search_document"
)
If you are using a model that supports these asymmetric modes and you are not using them, you are leaving retrieval quality on the table. This is one of the most common and most easily fixed mistakes in RAG implementations.
Prompt Engineering for Embeddings
A less discussed technique is adjusting the text you feed to the embedding model. This is not about changing the model; it is about changing the input.
Consider a chunk from a legal document:
Original chunk: "Section 4.2(b). Notwithstanding any provision herein to the contrary, the Indemnifying Party shall not be liable for any Losses to the extent arising from the Indemnified Party's gross negligence or willful misconduct."
A user might query: "Who is responsible if someone is grossly negligent?" The embedding model must bridge from casual language to legal prose. You can help by enriching the chunk before embedding:
Enriched chunk: "Indemnification limitation for negligence. This section addresses liability exceptions. Section 4.2(b). Notwithstanding any provision herein to the contrary, the Indemnifying Party shall not be liable for any Losses to the extent arising from the Indemnified Party's gross negligence or willful misconduct."
The prepended summary in plain language creates additional semantic hooks for the embedding model. This technique, sometimes called "contextual chunk enrichment," bridges the vocabulary gap without any model modification.
Putting It All Together: A Decision Framework
Given the complexity of the landscape, here is a practical framework for embedding model selection in RAG systems. It is deliberately opinionated.
An opinionated path: a strong default, an honest baseline, and a hierarchy of cheap fixes before any fine-tuning investment. Most teams skip from Step 1 to Step 5 and waste weeks; Steps 2 through 4 are where the cheap wins live.
Step 1: Start with a Strong Default
Pick one of these and build your entire pipeline around it:
- If you want a commercial API: OpenAI
text-embedding-3-smallat 1536 dimensions. Good quality, reasonable cost, and the Matryoshka property lets you reduce dimensions later if needed. - If you want open-source:
BAAI/bge-large-en-v1.5orintfloat/e5-large-v2at 1024 dimensions. Proven, well-documented, and you control the infrastructure. - If you need multilingual:
BAAI/bge-m3or Cohereembed-v3with multilingual support.
Step 2: Build Your Evaluation Set
Before optimizing anything, create a test set of 50 to 100 query-passage pairs that represent your actual use case. These should include:
- Real queries from users (or realistic synthetic ones)
- The passages you expect the system to retrieve
- A few adversarial cases where similar-looking passages are not relevant
Measure retrieval quality with Recall@k (what fraction of relevant passages appear in the top k results) and NDCG@k (which also considers ranking position). If you do not have an evaluation set, you do not have a basis for any optimization decision.
Step 3: Measure the Baseline
Run your evaluation set through the default model. Record the metrics. This is your baseline.
"""
Minimal RAG evaluation with retrieval metrics.
"""
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
def evaluate_retrieval(queries, relevant_passages, corpus, k=5):
"""
Compute Recall@k for a set of query-passage pairs.
"""
# Encode corpus once
corpus_embeddings = model.encode(corpus)
recalls = []
for query, expected_ids in zip(queries, relevant_passages):
query_embedding = model.encode(query)
# Compute similarities
similarities = np.dot(corpus_embeddings, query_embedding)
top_k_ids = np.argsort(similarities)[-k:][::-1]
# Check recall
hits = len(set(top_k_ids) & set(expected_ids))
recall = hits / len(expected_ids)
recalls.append(recall)
return np.mean(recalls)
# Example usage
recall = evaluate_retrieval(
queries=test_queries,
relevant_passages=test_relevant_ids,
corpus=document_chunks,
k=5
)
print(f"Recall@5: {recall:.3f}")
Step 4: Try Low-Hanging Fruit First
Before fine-tuning, exhaust these cheaper optimizations:
- Use query-passage prefixes if your model supports them. This alone can improve recall by 5-15%.
- Adjust chunking strategy. Chunks that are too large dilute the embedding. Chunks that are too small lose context. Experiment with 256, 512, and 1024 token chunks with appropriate overlap.
- Add a reranker. A cross-encoder reranker (like
cross-encoder/ms-marco-MiniLM-L-12-v2) that rescores the top 20-50 results from your bi-encoder can dramatically improve precision. This is often more impactful than changing the embedding model. - Enrich chunks with contextual metadata. Prepend document titles, section headers, or LLM-generated summaries to your chunks before embedding.
Step 5: Fine-Tune Only If Justified
If you have exhausted the above and your retrieval quality still falls short, fine-tuning is the next step. Follow the contrastive learning recipe described earlier. Start with synthetic queries generated by an LLM, mine hard negatives from your existing retrieval failures, and iterate.
Expect the process to take one to two weeks of focused engineering effort for a first iteration. Budget for three to five iterations before the model stabilizes.
Step 6: Monitor in Production
Embedding quality degrades over time as your corpus evolves and user query patterns shift. Build logging that captures retrieval results alongside user feedback signals (clicks, thumbs up/down, follow-up queries). Use this data to detect drift and to generate new fine-tuning examples for future iterations.
Drift in a production retrieval store does not announce itself with an error. The same query continues to return ten results, the API keeps responding, and only the relative ordering of those results shifts as the corpus and the model behind it evolve. The companion demo below simulates that scenario on a small synthetic corpus, exposing four detectors that surface different mechanisms of change: centroid distance and spatial KL respond when the population shifts, score-distribution PSI responds when the embedding function itself changes, and recall@K compared against a reference query suite responds to either. The auto-clicked controls walk through corpus growth, a model version push, and a Drift-Adapter mitigation in turn, so the detector signatures can be read against each cause.
Press Next in the banner inside the demo to step through each drift mechanism in turn, or open the dashboard in a new tab if it does not embed.
A Complete Working Example
Let us put the pieces together with a complete, runnable example that demonstrates embedding model usage in a minimal RAG pipeline:
"""
Minimal RAG pipeline demonstrating embedding model selection
and query-passage asymmetry.
Requirements:
pip install sentence-transformers numpy
"""
from sentence_transformers import SentenceTransformer
import numpy as np
from typing import List, Tuple
def build_index(
chunks: List[str],
model_name: str = "BAAI/bge-large-en-v1.5",
prefix: str = ""
) -> Tuple[SentenceTransformer, np.ndarray]:
"""
Encode document chunks into an embedding matrix.
Args:
chunks: List of text passages to index
model_name: HuggingFace model identifier
prefix: Optional prefix for passage encoding
Returns:
Tuple of (model, embedding_matrix)
"""
model = SentenceTransformer(model_name)
# Apply prefix if specified (e.g., for BGE models)
texts = [prefix + chunk for chunk in chunks]
embeddings = model.encode(
texts,
normalize_embeddings=True, # L2 normalize for cosine sim
show_progress_bar=True,
batch_size=32
)
return model, np.array(embeddings)
def retrieve(
query: str,
model: SentenceTransformer,
index: np.ndarray,
chunks: List[str],
k: int = 5,
query_prefix: str = ""
) -> List[Tuple[str, float]]:
"""
Retrieve top-k passages for a query.
Returns list of (passage, similarity_score) tuples.
"""
query_embedding = model.encode(
query_prefix + query,
normalize_embeddings=True
)
# Cosine similarity (embeddings are normalized)
similarities = np.dot(index, query_embedding)
top_k = np.argsort(similarities)[-k:][::-1]
return [
(chunks[i], float(similarities[i]))
for i in top_k
]
# ---- Example usage ----
# Sample corpus
chunks = [
"Python's garbage collector uses reference counting as its "
"primary mechanism. Each object maintains a count of "
"references pointing to it. When the count drops to zero, "
"the memory is immediately freed.",
"Circular references occur when two or more objects "
"reference each other, preventing their reference counts "
"from reaching zero. Python's gc module detects and "
"collects these cycles periodically.",
"Memory leaks in Python often stem from unintentional "
"references held in global variables, class attributes, "
"or closures that capture large objects.",
"The tracemalloc module, introduced in Python 3.4, "
"provides detailed memory allocation traces that help "
"developers identify the source of memory leaks.",
"Java uses a mark-and-sweep garbage collector that "
"periodically identifies and frees unreachable objects "
"from the heap.",
]
# Build index with BGE prefix convention
query_prefix = (
"Represent this sentence for searching relevant passages: "
)
model, index = build_index(chunks)
# Retrieve
results = retrieve(
query="What causes memory leaks in Python?",
model=model,
index=index,
chunks=chunks,
k=3,
query_prefix=query_prefix
)
for passage, score in results:
print(f"[{score:.3f}] {passage[:80]}...")
This example is deliberately minimal. A production system would add persistent storage (a vector database like Pinecone, Weaviate, or pgvector), batch processing, caching, and error handling. But the core logic remains: encode passages, encode queries (with appropriate prefixes), compute similarity, return the top results.
Looking Forward
The embedding model landscape is evolving in several directions simultaneously.
Late interaction models like ColBERT store per-token embeddings rather than a single vector per passage. This preserves fine-grained matching information at the cost of larger index sizes. For domains where exact term matching matters (legal, medical), the tradeoff is often worthwhile.
Multimodal embeddings are beginning to unify text, images, and structured data into shared vector spaces. Models like CLIP demonstrated this for text-image pairs; newer work extends the principle to tables, code, and domain-specific formats.
Longer context windows in embedding models (8192 tokens and beyond) reduce the need for aggressive chunking. If you can embed an entire document section as a single vector, you avoid the information loss that comes with splitting text at arbitrary boundaries.
Binary and quantized embeddings reduce storage requirements by orders of magnitude. Cohere's binary embeddings compress 1024 float32 dimensions into 128 bytes, a 32x reduction. The retrieval quality loss is modest for most applications, especially when combined with a reranking stage.
But the fundamental insight has not changed since Firth. Meaning lives in relationships. Embedding models encode those relationships as geometry. The quality of your RAG system depends on the fidelity of that encoding.
Choose the model carefully. Evaluate it honestly. And do not fine-tune until you have exhausted the simpler options.
The Vector Database
Every dense-retrieval tutorial shows three lines of code to store and retrieve vectors. The data structures underneath those three lines (HNSW, IVF, product quantization) determine whether the system handles ten thousand documents or collapses at one million. This part is the layered descent through HNSW, when IVF wins, why product quantization makes billion-scale search affordable, and how to choose an index for your scale.
The Brute-Force Baseline
Before we discuss any clever indexing, it is worth understanding the naive approach. You have a collection of vectors, each representing a document chunk, an image, or a sentence. A user submits a query, which gets embedded into the same vector space. You need to find the k most similar vectors in your collection.
The simplest strategy is to compare the query vector against every single stored vector, compute a similarity score for each, sort the results, and return the top k. This is called a flat scan or brute-force search. It produces perfect recall because you literally check everything.
For a collection of 10,000 vectors with 1,536 dimensions (the output size of OpenAI's text-embedding-3-small), a flat scan works fine. On modern hardware, you can compute around 10,000 cosine similarities in a few milliseconds. But the cost is O(n) per query, where n is the number of vectors. That linear scaling is the problem.
At one million vectors, each query now requires one million distance computations. At 100 million, the math becomes punishing. If you are serving an application with hundreds of concurrent users, flat scan is not viable. You need some way to avoid comparing against most of the collection while still returning accurate results.
This is the core tradeoff in vector search: speed versus recall. Every algorithm we will look at sacrifices perfect recall in exchange for sub-linear query time. The question is always how much accuracy you are willing to lose and how fast you need the answer.
HNSW: The Dominant Algorithm
Hierarchical Navigable Small World graphs, introduced by Malkov and Yashunin in 2018, have become the default indexing strategy for most vector databases. If you are using Pinecone, Weaviate, Qdrant, or Chroma under the hood, you are almost certainly running HNSW. Understanding why requires a short detour into graph theory.
The Small World Property
In the 1960s, Stanley Milgram ran his famous "six degrees of separation" experiments, demonstrating that any two people in the United States could be connected through roughly six intermediary acquaintances. The social network, despite containing hundreds of millions of nodes, had a remarkably short average path length. Mathematicians later formalized this as the "small world" property: a graph where most nodes are not directly connected, yet any node can be reached from any other through a small number of hops.
HNSW exploits this property for vector search. The idea is to build a graph where each vector is a node, and edges connect vectors that are relatively close in the embedding space. To find the nearest neighbor of a query, you do not scan the entire collection. Instead, you start at an entry point and greedily walk the graph, always moving to the neighbor closest to your query. If the graph has the small world property, you reach the neighborhood of the true nearest neighbor in O(log n) steps.
The Layered Structure
A plain small world graph works, but it can get trapped in local minima. HNSW solves this with a hierarchy of layers, and the analogy to urban navigation is almost exact.
Imagine you are trying to reach a specific coffee shop in an unfamiliar city. You would not start by walking down residential side streets at random. You would take the highway to the right part of the metro area, exit onto an arterial road to reach the right neighborhood, then navigate local streets to the specific block. Each transition narrows your search area dramatically.
HNSW works the same way. The top layer is sparse, containing only a few nodes with long-range connections (the highways). Each subsequent layer adds more nodes with shorter-range connections. The bottom layer, Layer 0, contains every vector in the collection with connections only to nearby neighbors (the side streets).
When a query arrives, the algorithm starts at the top layer's entry point and greedily navigates to the closest node it can find. It then drops down one layer, using that node as the starting point in the denser graph below. This continues layer by layer until it reaches the bottom, where it performs a final local search among the densest set of connections.
The assignment of nodes to layers is probabilistic. Each node always appears in Layer 0, but it gets promoted to higher layers with exponentially decreasing probability. This creates the right distribution: very few nodes in the top layer (providing coarse, long-range navigation) and all nodes in the bottom layer (providing fine-grained accuracy).
The query enters at the sparse top layer and drops into denser layers, narrowing the candidate set at each step until the final greedy search at Layer 0.
Key Parameters
Two parameters control HNSW behavior in practice:
Mis the maximum number of connections per node. Higher values mean better recall but more memory and slower index construction. Typical values range from 16 to 64.ef_constructioncontrols how many candidates the algorithm considers when building the graph. Higher values produce a better graph but take longer to build. Values of 100 to 200 are common.ef_search(sometimes calledef) controls how many candidates are considered at query time. This is the knob you turn to trade latency for recall. Anef_searchof 50 is fast but might miss some results; 200 is slower but more thorough.
HNSW achieves recall rates above 0.95 at query times measured in single-digit milliseconds for collections of one million vectors. The tradeoff is memory: the graph structure itself consumes significant RAM on top of the raw vectors. For a million 1,536-dimensional float32 vectors, the raw data is about 6 GB. The HNSW index can add another 2 to 4 GB depending on M.
That memory cost is why HNSW is not always the right choice.
IVF: Partitioning the Search Space
Inverted File Index takes a fundamentally different approach. Instead of building a graph, IVF partitions the entire vector space into regions and only searches the regions most likely to contain the answer.
Voronoi Cells
The construction phase runs k-means clustering on the full dataset, producing a set of centroid vectors. Each data vector is assigned to its nearest centroid, forming what mathematicians call Voronoi cells: regions of space where every point is closer to that cell's centroid than to any other centroid. If you have 1,024 centroids, your vector space is carved into 1,024 regions, each containing a roughly equal share of the data.
At query time, the algorithm compares the query vector against all centroids (a much smaller set, typically 256 to 16,384), identifies the closest centroids, and then performs an exhaustive search only within those cells. If your query falls near the center of a cell, searching just that one cell is probably sufficient. If it falls near a boundary, you need to search adjacent cells too.
The nprobe Parameter
This is where nprobe comes in. It controls how many cells to search at query time. With nprobe=1, you only search the single closest cell, which is fast but risky; your true nearest neighbor might be just across the boundary in an adjacent cell. With nprobe=10, you search the ten closest cells, dramatically improving recall at the cost of searching ten times more vectors.
The relationship between nprobe and recall is not linear. Going from nprobe=1 to nprobe=5 might jump your recall from 0.70 to 0.92. Going from 5 to 20 might only push it from 0.92 to 0.98. The diminishing returns are steep, and tuning this parameter is one of the most important decisions in an IVF deployment.
IVF has a significant advantage over HNSW: the index is much smaller. The only additional data stored beyond the raw vectors is the set of centroids and the cell assignments. For memory-constrained environments, this matters.
Product Quantization: Compressing Vectors
Both HNSW and IVF assume you can fit all your vectors in memory. At scale, this assumption breaks. A billion 1,536-dimensional float32 vectors consume roughly 6 terabytes of RAM. Product Quantization (PQ), introduced by Jegou, Douze, and Schmid in 2011, addresses this by compressing vectors to a fraction of their original size.
How PQ Works
The core idea is surprisingly elegant. Take a 1,536-dimensional vector and split it into, say, 192 sub-vectors of 8 dimensions each. For each of these 192 sub-spaces, run k-means clustering independently to learn 256 representative centroids (called a codebook). Now, instead of storing the original 8-dimensional sub-vector, you store just the index of its nearest centroid, which is a single byte.
Your original vector of 1,536 float32 values (6,144 bytes) is now represented by 192 bytes. That is a 32x compression ratio. A billion vectors that required 6 TB now fit in roughly 192 GB.
Distance computation also gets faster. Instead of computing the full distance between two 1,536-dimensional vectors, you precompute a lookup table of distances between the query's sub-vectors and all codebook centroids. The approximate distance between the query and any compressed vector is then a sum of 192 table lookups. This is substantially cheaper than 1,536 floating-point multiplications.
The Accuracy Cost
Compression always loses information. PQ introduces quantization error because each sub-vector is snapped to its nearest codebook centroid. Two vectors that were slightly different in the original space might map to the same compressed representation, making them indistinguishable to the search algorithm.
In practice, PQ alone achieves recall rates of 0.80 to 0.90 on standard benchmarks. This sounds mediocre, but PQ is rarely used alone. The standard approach is IVF+PQ: use IVF to narrow the search to a few cells, then use PQ for fast approximate distance computation within those cells. The combination maintains reasonable recall (0.90 to 0.95 with proper tuning) while enabling datasets of a billion vectors on a single machine.
Facebook's FAISS library, described by Johnson, Douze, and Jegou in 2019, made this combination accessible through a well-optimized C++ implementation with Python bindings. FAISS remains the reference implementation that most vector databases either use directly or reimagine.
Choosing the Right Index
The decision of which algorithm to use is primarily a function of dataset size, memory budget, and accuracy requirements. Here is a practical guide.
| Scale | Recommended Index | Why |
|---|---|---|
| Under 100K vectors | Flat (brute-force) | Perfect recall, sub-millisecond latency. No reason to introduce approximation at this scale. |
| 100K to 1M vectors | HNSW | Excellent recall (>0.95), single-digit millisecond queries. Memory overhead is manageable. |
| 1M to 10M vectors | HNSW or IVF | HNSW if you have the RAM. IVF if memory is constrained. Both deliver strong recall at this range. |
| 10M to 100M vectors | IVF+PQ | HNSW memory costs become prohibitive. IVF+PQ trades some recall for dramatically lower memory usage. |
| 100M+ vectors | IVF+PQ or specialized (ScaNN, DiskANN) | At this scale, you are likely using disk-based indices or distributed systems. FAISS IVF+PQ on GPU is one path; Google's ScaNN or Microsoft's DiskANN are others. |
Index choice changes shape as the dataset scales. Each algorithm has a sweet-spot range; the chart below shows the full bars where the algorithm is the natural choice and faded ends where it still works but loses to a neighbor.
A common mistake is reaching for HNSW at every scale. For a prototype with 5,000 document chunks, a flat index is simpler, faster to build, and returns perfect results. For a production system with 50 million product embeddings, HNSW's memory requirements might force you toward IVF+PQ even if you would prefer the higher recall.
Start simple. Complicate when the data demands it.
The Database Landscape
A vector index is not a vector database. FAISS, Annoy, and ScaNN are libraries that implement indexing algorithms. A vector database wraps those algorithms with persistence, CRUD operations, metadata storage, filtering, replication, and an API layer. The distinction matters because choosing a database is as much about operational concerns as algorithmic ones.
Pinecone
Pinecone is a fully managed service. You never provision servers, configure indices, or manage replication. You send vectors through an API and query them back. The indexing algorithm is opaque; Pinecone chooses and tunes it based on your data profile. For teams that want vector search without infrastructure overhead, Pinecone is the obvious choice. The tradeoffs are cost (managed services are never cheap at scale), vendor lock-in, and limited control over indexing parameters.
Chroma
Chroma occupies the opposite end of the spectrum. It runs in-process as a Python library, storing data locally. For development, prototyping, and small-scale deployments, it is hard to beat. You can have a working vector store in four lines of code. Chroma uses HNSW internally (via the hnswlib library) and provides a clean API for adding documents with metadata. The limitation is scale: Chroma is designed for single-machine workloads, and its persistence layer is straightforward rather than battle-tested.
Weaviate
Weaviate's distinguishing feature is hybrid search. It combines vector similarity with traditional keyword search (BM25) in a single query, letting you blend semantic and lexical matching. For RAG applications where you want to catch both semantically similar and keyword-matching documents, this is genuinely useful. Weaviate also supports custom modules for vectorization, meaning you can configure it to embed text automatically on ingest. It runs as a standalone service, self-hosted or through their managed cloud.
Qdrant
Qdrant is written in Rust and emphasizes performance and filtering capabilities. Its payload filtering system is particularly strong, supporting complex filter conditions that execute efficiently alongside vector search. If your use case involves heavy metadata filtering (say, searching product embeddings filtered by category, price range, and availability), Qdrant handles this well. It offers both a managed cloud and self-hosted deployment.
Milvus
Milvus is the heavyweight option, designed for billion-scale deployments. It supports multiple index types (HNSW, IVF, IVF+PQ, DiskANN), GPU acceleration, and distributed deployment across multiple nodes. The operational complexity is real, involving multiple microservices, an etcd dependency for coordination, and MinIO or S3 for object storage. If your problem truly requires billion-vector scale with high availability, Milvus is one of the few open-source options that handles it. For anything smaller, the operational burden is hard to justify.
pgvector
The pragmatist's choice. pgvector is a PostgreSQL extension that adds vector similarity search to the database you are probably already running. It supports both flat and HNSW indices, integrates with standard SQL queries, and benefits from PostgreSQL's mature ecosystem of tooling, backups, and replication. The performance is not competitive with purpose-built vector databases at large scale, but for applications under a few million vectors where you want to avoid adding another database to your infrastructure, pgvector is compelling. Your vectors live alongside your relational data in a single, well-understood system.
Practical Comparison
| Database | Best For | Index Types | Deployment |
|---|---|---|---|
| Pinecone | Teams wanting zero infra management | Proprietary (auto-tuned) | Managed cloud only |
| Chroma | Prototyping, small-scale, local dev | HNSW | In-process / local server |
| Weaviate | Hybrid search (vector + keyword) | HNSW | Self-hosted / managed cloud |
| Qdrant | Complex metadata filtering | HNSW | Self-hosted / managed cloud |
| Milvus | Billion-scale, GPU acceleration | HNSW, IVF, IVF+PQ, DiskANN | Distributed / managed cloud |
| pgvector | Already using PostgreSQL, moderate scale | Flat, HNSW, IVF | PostgreSQL extension |
There is no universally correct choice. But there is a useful heuristic: if you are building a prototype or course project, use Chroma. If you are building a production application and already run PostgreSQL, try pgvector first. If you need managed infrastructure at scale, evaluate Pinecone. Only reach for Milvus or a distributed setup when your data genuinely demands it.
Metadata Filtering and Why It Matters for RAG
In a real RAG system, you rarely want to search your entire vector collection. You want to search vectors that match certain conditions: documents from a specific user, chunks from a particular source, records created after a certain date. This is metadata filtering, and how a vector database implements it has significant performance implications.
Pre-filtering vs. Post-filtering
There are two strategies. Post-filtering runs the vector search first to retrieve the top candidates, then applies metadata filters to remove non-matching results. This is simple to implement but has an obvious flaw: if most of your top-K vector matches do not satisfy the filter, you end up with far fewer results than requested. You asked for 10 documents from "user_42" but the nearest 100 vectors all belong to other users.
Pre-filtering applies the metadata filter first, restricting the vector search to only those vectors that match the conditions. This guarantees you get the requested number of results (assuming enough matching vectors exist), but it complicates the index. The vector search algorithm now operates on a dynamic subset of the data, which can interact poorly with precomputed index structures like HNSW graphs or IVF cell assignments.
Most modern vector databases use a hybrid approach. Qdrant and Weaviate implement what is sometimes called "filtered HNSW," where metadata conditions are checked during graph traversal. A node is only considered a candidate if it passes the filter. This avoids the pitfalls of pure post-filtering while keeping the efficient graph traversal structure.
For RAG applications, pre-filtering is almost always what you want. A common pattern is to store metadata like source, user_id, document_type, and created_at alongside each vector, then filter on these fields at query time. If your vector database handles this poorly, your RAG system will return irrelevant or insufficient results regardless of how good your embeddings are.
The index is only half the story. The filter is the other half.
Code: Working with Vector Databases
Let us ground all of this in actual code. The following examples demonstrate basic operations using Chroma (for local development) and show how the concepts we have discussed map to API calls.
Setting Up Chroma
↗ docs# Install: pip install chromadb import chromadb from chromadb.utils import embedding_functions # Create a persistent client (data survives restarts) client = chromadb.PersistentClient(path="./chroma_data") # Use the default embedding function (all-MiniLM-L6-v2) # or bring your own from OpenAI, Cohere, etc. embedding_fn = embedding_functions.DefaultEmbeddingFunction() # Create or get a collection collection = client.get_or_create_collection( name="course_documents", embedding_function=embedding_fn, metadata={"hnsw:M": 32, "hnsw:ef_construction": 128} )
Notice the metadata parameter on the collection. This is where you configure the HNSW index parameters we discussed earlier. An M of 32 means each node in the graph maintains up to 32 connections. The ef_construction of 128 controls the quality of the graph during index building.
Adding Documents with Metadata
↗ docs# Add documents with metadata for filtering collection.add( documents=[ "Vector databases use approximate nearest neighbor algorithms.", "HNSW builds a navigable small-world graph for fast search.", "Product quantization compresses vectors to reduce memory.", "IVF partitions the vector space into Voronoi cells.", "Brute-force search compares every vector but does not scale.", ], metadatas=[ {"topic": "overview", "week": 4}, {"topic": "hnsw", "week": 4}, {"topic": "quantization", "week": 4}, {"topic": "ivf", "week": 4}, {"topic": "baseline", "week": 4}, ], ids=["doc_1", "doc_2", "doc_3", "doc_4", "doc_5"] )
Each document gets an embedding generated automatically by the embedding function. The metadata fields (topic, week) are stored alongside the vector and can be used for filtering at query time.
Querying with Metadata Filters
↗ docs# Simple semantic search results = collection.query( query_texts=["How do graph-based indices work?"], n_results=3 ) for doc, dist in zip(results["documents"][0], results["distances"][0]): print(f" [{dist:.4f}] {doc}") # Output: # [0.3821] HNSW builds a navigable small-world graph for fast search. # [0.7234] Vector databases use approximate nearest neighbor algorithms. # [0.8901] Brute-force search compares every vector but does not scale.
↗ docs# Filtered search: only search documents about specific topics filtered_results = collection.query( query_texts=["How can I reduce memory usage?"], n_results=2, where={"topic": {"$in": ["quantization", "ivf"]}} ) for doc in filtered_results["documents"][0]: print(f" {doc}") # Output: # Product quantization compresses vectors to reduce memory. # IVF partitions the vector space into Voronoi cells.
The where clause applies a pre-filter before vector search, ensuring results come only from documents matching the metadata condition. This is the filtered search pattern critical for multi-tenant RAG systems.
Using FAISS Directly
For more control over the indexing algorithm, you can use FAISS directly. This example builds and queries an IVF+PQ index.
↗ docsimport numpy as np import faiss # Generate sample data: 100,000 vectors of dimension 128 np.random.seed(42) dimension = 128 n_vectors = 100_000 data = np.random.randn(n_vectors, dimension).astype("float32") # Build a flat index (brute-force baseline) flat_index = faiss.IndexFlatL2(dimension) flat_index.add(data) # Build an IVF+PQ index # 256 Voronoi cells, vectors split into 16 sub-quantizers, 8 bits each n_cells = 256 n_subquantizers = 16 n_bits = 8 quantizer = faiss.IndexFlatL2(dimension) ivfpq_index = faiss.IndexIVFPQ(quantizer, dimension, n_cells, n_subquantizers, n_bits) # IVF+PQ requires training on representative data ivfpq_index.train(data) ivfpq_index.add(data) # Set nprobe: how many cells to search ivfpq_index.nprobe = 10 # Query query = np.random.randn(1, dimension).astype("float32") # Compare results flat_distances, flat_ids = flat_index.search(query, k=5) ivfpq_distances, ivfpq_ids = ivfpq_index.search(query, k=5) print("Flat (exact):", flat_ids[0]) print("IVF+PQ (approx):", ivfpq_ids[0]) # Measure recall: how many of IVF+PQ's results are in flat's results? recall = len(set(flat_ids[0]) & set(ivfpq_ids[0])) / 5 print(f"Recall@5: {recall:.2f}")
Notice the train step. Unlike HNSW (which builds incrementally), IVF+PQ needs to learn its Voronoi centroids and PQ codebooks from a representative sample of data before you can add vectors. This is a one-time cost, but it means IVF+PQ is less suitable for collections that change rapidly.
Measuring the Speed Difference
import time # Benchmark: 1000 queries against each index queries = np.random.randn(1000, dimension).astype("float32") start = time.perf_counter() flat_index.search(queries, k=10) flat_time = time.perf_counter() - start start = time.perf_counter() ivfpq_index.search(queries, k=10) ivfpq_time = time.perf_counter() - start print(f"Flat: {flat_time:.3f}s for 1000 queries") print(f"IVF+PQ: {ivfpq_time:.3f}s for 1000 queries") print(f"Speedup: {flat_time / ivfpq_time:.1f}x") # Typical output at 100K vectors: # Flat: 0.842s for 1000 queries # IVF+PQ: 0.031s for 1000 queries # Speedup: 27.2x
A 27x speedup at 100K vectors. At 10 million vectors, the gap grows to several hundred times. That is the difference between a usable application and one that times out.
Putting It Together for RAG
In the context of a RAG pipeline, the vector database sits between the embedding model and the language model. Documents are chunked, embedded, and stored during ingestion. At query time, the user's question is embedded, the vector database returns the most relevant chunks, and those chunks are passed to the language model as context.
The choice of indexing algorithm affects three things that matter to end users: latency (how long until they see a response), relevance (whether the retrieved context actually answers their question), and cost (how much infrastructure you need to run).
For a course project or prototype with a few thousand documents, a flat index through Chroma gives you perfect recall with negligible latency. For a production system with millions of documents, you will likely need HNSW through a managed service like Pinecone or a self-hosted solution like Qdrant, with careful attention to metadata filtering for multi-tenant isolation.
The algorithms are the foundation. But the engineering decisions around chunking strategy, embedding model selection, metadata schema design, and filter architecture are what determine whether a RAG system works well in practice. The vector database is necessary infrastructure, not sufficient infrastructure.
Get the index right, then focus on everything around it.
Chunking and Context
The most consequential decision in a dense-retrieval pipeline is the one most teams spend the least time on: how to split documents into pieces. This part is a systematic comparison of fixed-size, recursive, semantic, and context-aware chunking, the effect of chunk size on retrieval precision and recall, and context positioning so the lost-in-the-middle effect does not bury high-priority chunks. One honest scoping note up front: chunking is primarily a dense-retrieval concern. Sparse retrieval over an inverted index (BM25 in Elasticsearch) does not have a chunking problem at scale because term-matching scores individual terms anywhere in a document, regardless of length.
When teams build retrieval-augmented generation systems, they tend to obsess over the choice of embedding model, the vector database, the reranking strategy. These are important decisions. But the upstream choice, the one that constrains everything downstream, is far more mundane. It is the question of how you break a document into chunks before embedding it.
Get chunking wrong and no amount of engineering elsewhere will save your retrieval quality. Get it right and even a simple pipeline can produce surprisingly good results. This article walks through the major chunking strategies, their tradeoffs, and the practical considerations that should guide your choices.
A Brief History of Text Segmentation
The problem of splitting text into meaningful units is far older than RAG. Information retrieval researchers have wrestled with segmentation since the 1970s, when Salton's SMART system had to decide what constituted a "document" for indexing purposes. In early search engines, the document was typically the unit of retrieval: a whole web page, a whole email, a whole file.
The shift toward sub-document retrieval came gradually. Passage retrieval, where systems return specific paragraphs rather than whole documents, gained traction in the TREC competitions of the late 1990s and early 2000s. Researchers discovered that returning a focused passage often produced better answers than returning an entire relevant document, because the user did not have to hunt for the specific information they needed.
The rise of dense retrieval models in the late 2010s made the segmentation question urgent again. Unlike sparse keyword methods (TF-IDF, BM25), which can score individual terms anywhere in a document, dense models compress the entire input into a single vector. The quality of that vector depends critically on what you feed the model: feed it too much and the representation blurs, feed it too little and it lacks context. The chunking problem, in its modern form, was born.
Why Chunking Matters
Embedding models transform text into dense vectors, typically of 768 or 1536 dimensions. These vectors represent the "meaning" of the input text as a point in high-dimensional space. When you pass a user's query through the same embedding model, you get another vector, and retrieval becomes a nearest-neighbor search.
The problem is that embedding models have context windows, usually between 512 and 8192 tokens depending on the model. You cannot embed an entire 50-page document in a single pass. Even models with larger context windows suffer from a more fundamental issue: as the input text grows longer, the resulting embedding becomes an average of too many ideas.
Consider a research paper that discusses methodology in section 3, results in section 4, and limitations in section 5. If you embed the entire paper as one vector, that vector represents a blurred composite of methodology, results, and limitations simultaneously. When a user asks "What were the limitations of this study?", the embedding of the full paper is a mediocre match because the limitations signal is diluted by everything else.
This is the average meaning problem. A single embedding cannot faithfully represent multiple distinct topics. The solution is to split documents into smaller pieces, each focused enough that its embedding captures a coherent idea. The question is how.
Fixed-Size Chunking: The Baseline
The simplest approach is to split text into chunks of a fixed token count. This is where most teams start, and for good reason: it is easy to implement, easy to reason about, and works better than you might expect.
The choice of chunk size matters more than it might appear. Let's look at what different sizes actually contain, using a passage from a hypothetical textbook on machine learning:
256 tokens (~1 paragraph)
"Gradient descent is an optimization algorithm used to minimize a function by iteratively moving in the direction of steepest descent. In machine learning, we use it to find the parameters that minimize our loss function. The algorithm computes the gradient of the loss with respect to each parameter, then updates the parameters by subtracting a fraction of the gradient. This fraction is called the learning rate."
At 256 tokens, you capture roughly one focused idea. The embedding will be specific, which is excellent for precision. If someone asks "What is gradient descent?", this chunk is a strong match. But if they ask "How does gradient descent relate to backpropagation?", this chunk alone cannot answer, because that connection was discussed two paragraphs later.
512 tokens (~2-3 paragraphs)
"Gradient descent is an optimization algorithm used to minimize a function by iteratively moving in the direction of steepest descent. In machine learning, we use it to find the parameters that minimize our loss function. The algorithm computes the gradient of the loss with respect to each parameter, then updates the parameters by subtracting a fraction of the gradient. This fraction is called the learning rate. The choice of learning rate is critical. Too large, and the algorithm will overshoot the minimum, oscillating or diverging entirely. Too small, and convergence becomes impractically slow. Modern practice uses adaptive learning rate methods like Adam, which adjust the rate per parameter based on the history of gradients."
At 512 tokens, you capture a topic with some of its immediate context. The embedding represents a broader concept while remaining reasonably focused. This is the most common default for general-purpose RAG systems, and it represents a sensible middle ground.
1024 tokens (~5-6 paragraphs)
At 1024 tokens, you capture an entire subsection of a document. The embedding now represents a broader theme. This works well when queries are high-level ("Explain the optimization process in neural networks") but poorly when queries are specific ("What is the default learning rate in Adam?").
Here is a simple fixed-size chunker in Python:
↗ docsimport tiktoken def fixed_size_chunks(text: str, chunk_size: int = 512, encoding_name: str = "cl100k_base") -> list[str]: """Split text into fixed-size token chunks.""" encoder = tiktoken.get_encoding(encoding_name) tokens = encoder.encode(text) chunks = [] for i in range(0, len(tokens), chunk_size): chunk_tokens = tokens[i : i + chunk_size] chunks.append(encoder.decode(chunk_tokens)) return chunks # Example usage text = open("document.txt").read() chunks = fixed_size_chunks(text, chunk_size=512) print(f"Created {len(chunks)} chunks")
Fixed-size chunking has one glaring flaw: it is completely ignorant of document structure. A chunk boundary might land in the middle of a sentence, in the middle of a code block, or right between a heading and the paragraph it introduces. This is where overlap comes in.
Overlap: Bridging the Boundaries
When you split a document into non-overlapping chunks, information at the boundaries gets severed. A sentence that begins at the end of chunk 7 and finishes at the start of chunk 8 will be incomplete in both chunks. Neither embedding will faithfully represent what that sentence means.
The standard solution is to overlap chunks. Instead of each chunk starting where the previous one ended, you slide the window forward by less than the full chunk size. With a chunk size of 512 tokens and an overlap of 50 tokens, chunk 1 covers tokens 0-511, chunk 2 covers tokens 462-973, chunk 3 covers tokens 924-1435, and so on.
Typical overlap ratios fall between 10% and 20% of the chunk size. For 512-token chunks, that means 50 to 100 tokens of overlap. The tradeoffs are straightforward:
- More overlap means better boundary coverage but more total chunks, more storage, more embedding cost, and more redundancy in retrieval results.
- Less overlap means fewer chunks and lower cost but a higher chance of losing information at boundaries.
- No overlap is rarely the right choice unless your chunks already align with natural document boundaries (like section headers).
Here is the fixed-size chunker extended with overlap:
def fixed_size_chunks_with_overlap( text: str, chunk_size: int = 512, overlap: int = 50, encoding_name: str = "cl100k_base" ) -> list[str]: """Split text into fixed-size token chunks with overlap.""" encoder = tiktoken.get_encoding(encoding_name) tokens = encoder.encode(text) chunks = [] start = 0 while start < len(tokens): end = start + chunk_size chunk_tokens = tokens[start:end] chunks.append(encoder.decode(chunk_tokens)) start += chunk_size - overlap # slide window return chunks # 512 tokens with ~10% overlap chunks = fixed_size_chunks_with_overlap(text, chunk_size=512, overlap=50) print(f"Created {len(chunks)} chunks with overlap")
A common mistake is treating overlap as a guaranteed fix for boundary problems. It helps, but it does not eliminate the issue. If a key concept spans 200 tokens and happens to straddle a boundary, a 50-token overlap will not capture it whole. Overlap is a probabilistic mitigation, not a solution.
Recursive Chunking: Respecting Structure
Documents are not flat streams of tokens. They have structure: titles, sections, subsections, paragraphs, sentences. A chunking strategy that respects this structure will produce more coherent chunks than one that ignores it.
Recursive chunking, popularized by LangChain's RecursiveCharacterTextSplitter, works by attempting to split text using a hierarchy of separators. It first tries to split on the largest structural boundary (like double newlines, which typically indicate section breaks). If the resulting pieces are still too large, it splits those pieces on the next separator (single newlines, which indicate paragraph breaks). If pieces are still too large, it falls to sentence boundaries, then word boundaries.
The separator hierarchy for a typical document looks like this:
↗ docs# LangChain's default separator hierarchy separators = [ "\n\n", # Double newline (section/paragraph breaks) "\n", # Single newline " ", # Space (word boundaries) "", # Character-level (last resort) ]
For Markdown documents, you can use a richer hierarchy:
↗ docs# Markdown-aware separators markdown_separators = [ "\n# ", # H1 headers "\n## ", # H2 headers "\n### ", # H3 headers "\n\n", # Paragraph breaks "\n", # Line breaks ". ", # Sentence boundaries " ", # Word boundaries ]
Here is a practical implementation using LangChain:
↗ docsfrom langchain.text_splitter import RecursiveCharacterTextSplitter # General-purpose recursive splitter splitter = RecursiveCharacterTextSplitter( chunk_size=1000, # characters, not tokens chunk_overlap=200, separators=["\n\n", "\n", ". ", " ", ""], length_function=len, ) text = open("document.txt").read() chunks = splitter.split_text(text) for i, chunk in enumerate(chunks[:3]): print(f"--- Chunk {i+1} ({len(chunk)} chars) ---") print(chunk[:200]) print()
The advantage of recursive chunking is that most chunks will align with natural boundaries in the text. A paragraph will not be split across two chunks unless it exceeds the maximum chunk size on its own. This produces embeddings that represent coherent thoughts rather than arbitrary slices.
LangChain's RecursiveCharacterTextSplitter has become the de facto standard for this reason. It is not the most sophisticated approach, but it is reliable, well-tested, and handles the common cases well. Most production RAG systems use some variant of this strategy.
Semantic Chunking: Following the Meaning
Fixed-size and recursive chunking both operate on surface-level features of text: token counts, newlines, punctuation. They do not understand what the text is about. Semantic chunking takes a different approach entirely: it uses the content's meaning to decide where to split.
The core idea, articulated clearly by Greg Kamradt in 2023, is to measure the semantic similarity between consecutive sentences. When adjacent sentences are about the same topic, their embeddings will be similar. When the topic shifts, the similarity drops. These drops are natural split points.
The algorithm works as follows:
- Split the document into sentences.
- Embed each sentence (or small groups of sentences for stability).
- Compute the cosine similarity between each consecutive pair of sentence embeddings.
- Identify points where the similarity drops below a threshold, or where the drop is significantly larger than average.
- Split the document at those points.
Here is a simplified implementation:
↗ docsimport numpy as np from sentence_transformers import SentenceTransformer from sklearn.metrics.pairwise import cosine_similarity def semantic_chunk( text: str, threshold_percentile: int = 25, model_name: str = "all-MiniLM-L6-v2" ) -> list[str]: """Split text into semantic chunks based on topic shifts.""" # Step 1: Split into sentences sentences = [s.strip() for s in text.split(".") if s.strip()] if len(sentences) < 3: return [text] # Step 2: Embed each sentence model = SentenceTransformer(model_name) embeddings = model.encode(sentences) # Step 3: Compute similarities between consecutive sentences similarities = [] for i in range(len(embeddings) - 1): sim = cosine_similarity( [embeddings[i]], [embeddings[i + 1]] )[0][0] similarities.append(sim) # Step 4: Find split points where similarity drops threshold = np.percentile(similarities, threshold_percentile) split_indices = [ i + 1 for i, sim in enumerate(similarities) if sim < threshold ] # Step 5: Build chunks chunks = [] start = 0 for idx in split_indices: chunk = ". ".join(sentences[start:idx]) + "." chunks.append(chunk) start = idx # Don't forget the last chunk chunks.append(". ".join(sentences[start:]) + ".") return chunks
Semantic chunking produces variable-length chunks that correspond to topical segments of the document. A section discussing methodology might become one chunk, while the following section on experimental results becomes another, regardless of their respective lengths.
The tradeoffs are real. Semantic chunking requires embedding every sentence in the document before you even begin the actual chunking, which makes it significantly more expensive than recursive or fixed-size approaches. For a corpus of 10,000 documents, this preprocessing cost adds up quickly.
When is it worth the cost? Heterogeneous documents benefit most. A document that mixes financial analysis, legal clauses, and technical specifications has dramatic topic shifts that semantic chunking handles gracefully. A research paper with a conventional structure benefits less, because the structure itself (sections, headers) already provides reliable split points that recursive chunking can exploit.
Document-Type-Specific Strategies
The best chunking strategy is the one that understands your documents. Different document types have different natural boundaries, and exploiting those boundaries produces better chunks than any generic approach.
Source Code
Code has explicit structure that text lacks. Functions, classes, and methods are natural chunk boundaries. Splitting a function across two chunks is almost always wrong, because each half will be difficult to interpret without the other.
↗ docsfrom langchain.text_splitter import ( RecursiveCharacterTextSplitter, Language, ) # Python-aware code splitter python_splitter = RecursiveCharacterTextSplitter.from_language( language=Language.PYTHON, chunk_size=2000, chunk_overlap=200, ) # This will split on class and function boundaries code = open("my_module.py").read() chunks = python_splitter.split_text(code)
LangChain provides language-aware splitters for Python, JavaScript, TypeScript, Go, Rust, Java, and several other languages. These use language-specific separators (class definitions, function definitions, decorators) instead of generic newlines.
Markdown
Markdown documents should be split on headers. Each section under a header forms a natural topical unit. The MarkdownHeaderTextSplitter in LangChain handles this directly:
↗ docsfrom langchain.text_splitter import MarkdownHeaderTextSplitter headers_to_split_on = [ ("#", "h1"), ("##", "h2"), ("###", "h3"), ] splitter = MarkdownHeaderTextSplitter( headers_to_split_on=headers_to_split_on ) markdown_text = open("document.md").read() chunks = splitter.split_text(markdown_text) # Each chunk includes header metadata for chunk in chunks[:3]: print(chunk.metadata) # {'h1': 'Introduction', 'h2': 'Background'} print(chunk.page_content[:100]) print()
Notice that the splitter automatically attaches header metadata to each chunk. This is enormously useful for retrieval, as we will discuss in the metadata section below.
Legal Documents
Legal text is organized around clauses, sections, and subsections, each with numbering schemes like "Section 4.2(b)(iii)". Splitting on these boundaries preserves the logical units that lawyers and compliance systems need. A regex-based splitter that recognizes common legal numbering patterns will outperform generic chunking on contracts and regulations.
Conversations and Chat Logs
Conversations should be split on turn boundaries. Each turn (or small group of turns) forms a natural chunk. Splitting mid-turn destroys the question-answer pairing that gives conversational text its meaning.
def chunk_conversation( messages: list[dict], turns_per_chunk: int = 4 ) -> list[str]: """Chunk a conversation by grouping turns together.""" chunks = [] for i in range(0, len(messages), turns_per_chunk): group = messages[i : i + turns_per_chunk] chunk_text = "\n".join( f"{m['role']}: {m['content']}" for m in group ) chunks.append(chunk_text) return chunks
The general principle is clear: if your documents have structure, use it. Generic chunking is a fallback for when you do not know what your documents look like. Once you do know, specialize.
Tables and Structured Data
Tables are a special challenge. A table row makes little sense without its column headers, and a table split across two chunks will confuse both the embedding model and the language model downstream. The safest approach is to treat each table as an atomic unit. If a table is too large to fit in a single chunk, consider converting it to a series of natural-language statements ("Revenue in Q3 2024 was $4.2M") rather than splitting the table itself.
PDFs with Mixed Content
Real-world PDFs often combine narrative text, tables, images with captions, headers, footers, and page numbers. A robust chunking pipeline for PDFs needs to handle extraction before chunking. Tools like PyMuPDF, Unstructured, or Amazon Textract can classify page elements by type, allowing you to route different content types to different chunking strategies. Narrative paragraphs get recursive chunking; tables get preserved whole; image captions get attached to their nearest text chunk as metadata.
The Chunk Size and Retrieval Quality Curve
There is a fundamental tension in chunk size selection that no strategy fully resolves. Smaller chunks produce more precise embeddings, while larger chunks preserve more context. The relationship between chunk size and retrieval quality is not linear; it is a curve with a peak that depends on the nature of your queries.
Consider two extremes. At one end, you chunk at the sentence level. Each embedding is maximally precise: it represents exactly one idea. If a user's query matches that idea, the retrieval is excellent. But most queries require context that spans multiple sentences, and sentence-level chunks force the language model to reconstruct that context from scattered fragments. Liu et al. (2023) showed that language models struggle when relevant information is distributed across many retrieved passages, particularly when key details end up in the middle of the context window.
At the other extreme, you chunk at the section or page level. Each embedding carries abundant context, but the signal is diluted. A page-level chunk about machine learning optimization will be a decent match for "What is gradient descent?" but also for "What is Adam?" and "What is learning rate scheduling?" and a dozen other queries. Precision suffers.
Chunk size and retrieval quality follow an inverted-U trade-off. Precision falls as chunks grow because the matching content gets diluted by surrounding noise; context coverage rises because larger chunks carry more of the answer in a single hit. Their product peaks in the 256 to 1024 token range for most use cases.
The sweet spot depends on query type:
- Factoid queries ("What year was the company founded?") benefit from smaller chunks (256-512 tokens) that isolate specific facts.
- Conceptual queries ("Explain the company's growth strategy") benefit from larger chunks (512-1024 tokens) that capture reasoning and relationships.
- Multi-hop queries ("How did the company's growth strategy change after the 2020 acquisition?") may benefit from a combination of chunk sizes, or from a reranking step that assembles context from multiple small chunks.
In practice, 512 tokens with 10-15% overlap is a reasonable starting point for most use cases. But you should always evaluate on your actual queries. Measure retrieval quality (recall@k, precision@k, or MRR) across different chunk sizes with a representative set of questions. The optimal size for your specific application may surprise you.
Metadata Enrichment: Context Beyond the Text
A chunk by itself is just a fragment of text. Without metadata, a chunk cannot be filtered by source, cited back to its original document, or contextualized within the surrounding section. Adding structured metadata to each chunk transforms it from an anonymous text snippet into a traceable piece of information.
Essential metadata fields include:
- Source document: filename, URL, or document ID. Without this, you cannot tell the user where an answer came from.
- Section title: the heading under which this chunk appeared. This enables filtering ("only search in the Methods section") and provides context to the language model.
- Page number: critical for PDFs and long documents where users need to verify information.
- Chunk index: the position of this chunk within the document, enabling retrieval of adjacent chunks for additional context.
- Document type: report, email, contract, transcript. Enables type-based filtering.
- Date: when the source document was created or last modified. Essential for time-sensitive domains.
Here is a complete chunking pipeline with metadata enrichment:
↗ docsfrom dataclasses import dataclass, field from langchain.text_splitter import RecursiveCharacterTextSplitter @dataclass class EnrichedChunk: text: str metadata: dict = field(default_factory=dict) def chunk_with_metadata( text: str, source: str, doc_type: str = "unknown", chunk_size: int = 1000, chunk_overlap: int = 200, ) -> list[EnrichedChunk]: """Chunk text and attach metadata to each chunk.""" splitter = RecursiveCharacterTextSplitter( chunk_size=chunk_size, chunk_overlap=chunk_overlap, separators=["\n\n", "\n", ". ", " ", ""], ) raw_chunks = splitter.split_text(text) enriched = [] for i, chunk_text in enumerate(raw_chunks): chunk = EnrichedChunk( text=chunk_text, metadata={ "source": source, "doc_type": doc_type, "chunk_index": i, "total_chunks": len(raw_chunks), "char_count": len(chunk_text), } ) enriched.append(chunk) return enriched # Usage chunks = chunk_with_metadata( text=open("annual_report.txt").read(), source="annual_report_2024.pdf", doc_type="financial_report", ) print(chunks[0].metadata) # {'source': 'annual_report_2024.pdf', 'doc_type': 'financial_report', # 'chunk_index': 0, 'total_chunks': 47, 'char_count': 987}
Metadata also enables a powerful retrieval pattern: retrieve a chunk, then fetch its neighbors. If chunk 12 is relevant, chunks 11 and 13 probably provide useful context. This "context window expansion" at retrieval time partially compensates for using smaller chunk sizes.
Some teams go further, prepending a summary of the section title or parent document to each chunk before embedding. This biases the embedding to capture not just the chunk's content but its role within the larger document. The cost is slightly larger chunks and more preprocessing, but the improvement in retrieval relevance can be substantial.
Common Pitfalls
Having worked through the strategies, it is worth cataloging the mistakes that appear most often in practice.
Ignoring token vs. character distinctions. LangChain's RecursiveCharacterTextSplitter measures in characters by default, not tokens. A 1000-character chunk is roughly 200-250 tokens, depending on the text. If your embedding model has a 512-token limit and you set chunk_size=512 in a character-based splitter, your chunks will be far too small. Always be explicit about your unit of measurement.
Using the same strategy for all document types. A pipeline that chunks legal contracts and Python source code with the same recursive text splitter is leaving quality on the table. The cost of implementing document-type routing is small compared to the retrieval improvement it yields.
Neglecting to evaluate. Many teams choose a chunk size based on intuition or blog posts and never measure whether a different size would work better. Even a simple experiment, comparing retrieval recall at k=5 across three chunk sizes, can reveal significant differences.
Over-relying on overlap. Some teams set overlap to 50% of chunk size, creating massive redundancy without proportionate benefit. This doubles your storage and embedding costs while only marginally improving boundary coverage. If you need that much overlap, your chunks are probably too small.
Stripping formatting before chunking. Whitespace, headers, and list markers carry structural information that chunking strategies depend on. Aggressively normalizing text before chunking removes the signals that recursive and document-type-specific splitters need to find good boundaries.
Evaluating Your Chunking Strategy
The only reliable way to choose a chunking strategy is to measure its impact on your specific use case. Here is a lightweight evaluation approach that requires no specialized tooling:
from sentence_transformers import SentenceTransformer import numpy as np def evaluate_chunking( chunks: list[str], queries: list[str], relevant_chunks: list[list[int]], # ground truth: indices of relevant chunks per query model_name: str = "all-MiniLM-L6-v2", k: int = 5, ) -> dict: """Evaluate a chunking strategy using recall@k and MRR.""" model = SentenceTransformer(model_name) chunk_embeddings = model.encode(chunks) query_embeddings = model.encode(queries) # Compute similarities similarities = np.dot(query_embeddings, chunk_embeddings.T) recall_scores = [] mrr_scores = [] for i, query in enumerate(queries): # Get top-k chunk indices top_k = np.argsort(similarities[i])[::-1][:k] # Recall@k: fraction of relevant chunks in top-k relevant = set(relevant_chunks[i]) retrieved = set(top_k.tolist()) recall = len(relevant & retrieved) / max(len(relevant), 1) recall_scores.append(recall) # MRR: reciprocal rank of first relevant result for rank, idx in enumerate(top_k, 1): if idx in relevant: mrr_scores.append(1.0 / rank) break else: mrr_scores.append(0.0) return { "recall@k": np.mean(recall_scores), "mrr": np.mean(mrr_scores), "num_chunks": len(chunks), } # Compare strategies for strategy_name, chunks in strategies.items(): results = evaluate_chunking(chunks, test_queries, ground_truth) print(f"{strategy_name}: Recall@5={results['recall@k']:.3f}, MRR={results['mrr']:.3f}, Chunks={results['num_chunks']}")
The ground truth (which chunks are relevant to which queries) does require manual annotation, but even 30-50 annotated query-chunk pairs are enough to reveal meaningful differences between strategies. This is a few hours of work that can save weeks of debugging mysterious retrieval failures.
Putting It Together: A Decision Framework
With so many options, how do you choose? Here is a practical framework:
The path is straightforward: recursive baseline first, specialize by document type, attach metadata immediately, then measure before reaching for semantic chunking's compute cost.
Start with recursive chunking. Use LangChain's RecursiveCharacterTextSplitter with 512-1000 characters, 10-20% overlap, and separators appropriate for your document type. This is your baseline.
Specialize for known document types. If your corpus is entirely Markdown, use a Markdown-aware splitter. If it is code, use a language-aware splitter. If it is a mix, classify documents first and route to specialized splitters.
Add metadata from the start. Retrofitting metadata onto an existing chunk store is painful. Build it into the pipeline on day one. At minimum, track source document, chunk position, and section title.
Evaluate empirically. Create a test set of 50-100 representative queries with known relevant documents. Measure retrieval quality across different chunk sizes and strategies. The results will tell you more than any theoretical argument.
Consider semantic chunking for high-value corpora. If your documents are heterogeneous, if topic shifts are unpredictable, and if retrieval quality is critical enough to justify the compute cost, semantic chunking can provide meaningful improvements over structure-based approaches.
The Lewis et al. (2020) RAG paper demonstrated that retrieval quality is the primary bottleneck in retrieval-augmented generation. The language model can only work with what retrieval gives it. If your chunks are incoherent, your embeddings will be imprecise, your retrieval will be noisy, and your generated answers will suffer. Chunking is the foundation.
Most teams spend days evaluating embedding models and hours choosing chunk sizes. Invert that ratio. The embedding model matters, but it operates on what your chunking strategy gives it. Give it coherent, well-bounded, metadata-rich chunks and even a modest embedding model will produce good retrieval. Give it arbitrary slices of text and the best embedding model in the world will struggle.
Chunking is not glamorous work. It does not appear in paper titles or conference talks. But it is where RAG pipelines are won or lost.
How the Layers Cascade
The three layers above are presented as separate decisions, but in production they are coupled. A choice in any one of them constrains what is reasonable in the other two, and the team that picks each layer independently ends up with a stack whose pieces work against each other. This part names the four cascades that show up most often.
Embedding dimension cascades into vector-database cost
A 1024-dimensional embedding does not cost twice as much to store and search as a 512-dimensional embedding. It costs significantly more than that, because the HNSW graph that indexes the vectors gets denser at higher dimensions, the memory footprint grows linearly with dimension at a base that is already large, and the latency of nearest-neighbor search grows with dimension because each distance computation touches every component. A team that chooses a 1024-dim model in Part 1 because it scored a fraction higher on MTEB is making a choice about Part 2 they have not registered. The right way to evaluate the embedding model is on the dimension-times-cost frontier, not on accuracy alone.
Product quantization is the lever that decouples this. PQ stores each vector as a sequence of small codebook indices instead of full-precision floats, with a quality cost that is usually small on dense retrievers and a memory cost that is an order of magnitude lower. Teams that have to use a high-dimensional embedding for accuracy reasons should plan on PQ from the start, not as a late optimization. The HNSW + PQ combination is the dominant production configuration for a reason.
Chunk size cascades into embedding context and retrieval granularity
Every embedding model has a maximum input length, and a chunk that exceeds it gets truncated silently by the tokenizer. A 500-token chunk fed to a 512-token model loses anything past the first 512 tokens, with no error and no warning in most pipelines. Teams that choose a large chunk size in Part 3 because "context is good" without checking the embedding model's input length end up with embeddings that represent only the head of each chunk.
The other direction is just as broken. A 64-token chunk that fits comfortably inside any modern embedding model produces a vector representing a very narrow window of text, which means the retrieval step returns many small fragments and the generator has to reassemble the document mentally. The right chunk size is the one that fits the embedding's input length AND captures enough context to be a useful unit of retrieval. Those two constraints converge on chunks in the 300 to 600 token range for most pipelines, which is the empirical sweet spot the research literature reaches by every path.
The chunking strategy cascades into what reranking can recover
Reranking is the standard fix when retrieval precision is too low. A bi-encoder retrieves the top fifty candidates, then a slower cross-encoder rescores those fifty against the query, and the top three from the rescored list go into the LLM context. This works when the retrieval step surfaces the right document somewhere in its fifty-result candidate set; reranking cannot recover what was never retrieved.
If chunking destroys the contextual unit (a recursive chunker that splits a four-paragraph argument across three chunks, none of which is individually retrievable for the query the user actually asked), no amount of reranking will recover it. The chunker put the relevant content in three pieces, each of which is too narrow to score well on its own, and the bi-encoder will surface a different chunk's higher score for a tangential reason. The chunker's job is to produce units that are atomically retrievable. Reranking optimizes the order of what was found, not the set of what is findable.
The embedding model cascades into chunking strategy and reindexing cost
Different embedding models tokenize differently and trade off context length against quality differently, which means the optimal chunk size for one embedding model is not the optimal chunk size for another. A team that picks an embedding model and a chunk size together, then later swaps the embedding model for a newer one (which happens every twelve to eighteen months in production), discovers that the optimal chunk size has moved as well. The re-embedding cost is unavoidable when the model changes; the re-chunking cost can be invisible if nobody re-evaluates whether the original chunk-size choice still applies.
This is the cascade that hurts production teams most often. The first version of the system was tuned end to end with one embedding model and one chunk size; six months later the embedding model upgrade lands and quality drops, not because the new embedding model is worse, but because the chunk-size choice that was optimal for the old model is now suboptimal for the new one. The retrieval metrics tell the team the new embedding is bad. The real problem is that two layers of the stack are tuned against each other and only one of them moved.
The point of the merger
None of the four cascades above is novel in itself. Each one is mentioned somewhere in the body of one of the three original articles. What none of those articles could do, individually, was name the cascades together and let the reader hold them in one frame. The merger is what makes the cascade visible.
For a Week 5 student, the load-bearing lesson is the second-order one: the dense-vector retrieval stack is one design problem with three knobs, not three design problems with one knob each. Treat it as the former and the system that gets shipped will hold together as the embedding models, the database, and the chunker each evolve at their own pace. Treat it as the latter and the team will spend the back half of the year debugging cross-layer failures that nobody wrote the system to catch.
References
This article merges three previously separate published works. The references for each part are preserved at the original article URLs, which remain live as pre-merger snapshots:
- The Embedding Model Landscape: original Part 1 references (MTEB, BGE, E5, contrastive learning, fine-tuning, and the six-step decision framework).
- How Vector Databases Actually Work: original Part 2 references (HNSW, IVF, product quantization, vector DB benchmarks, and index selection).
- The Art of Chunking: original Part 3 references (chunking strategies, lost-in-the-middle, semantic chunking, and context positioning).
- What Classic Search Does Before the LLM: the lexical floor this article builds on, and where the binary heuristic for "do exact-match expectations exist?" is established.
- The Retrieval Quality Problem: companion piece on precision, recall, hybrid retrieval, and stratified evaluation. The natural next reading after this one.
- Retrieval Provenance: companion piece on source, confidence, timestamp, and agent metadata threading through any retrieval pipeline, vector or otherwise.