← All Articles

RETIRED · FOLDED INTO MEASURING RETRIEVAL This article has been retired. Its unique content (the cross-encoder reranking pattern, query-transformation techniques including HyDE and step-back prompting, and the common failure-mode taxonomy) has been folded into a new section in Measuring Retrieval, which is now the canonical Week 5 reading for retrieval evaluation and the engineering levers that follow from it. The original remains here for anyone who arrived via an external link; do not cite this URL going forward.

The Retrieval Quality Problem

Most RAG failures are not generation failures. The LLM never had a chance because the right documents never made it into the context window.

When a retrieval-augmented generation system produces a wrong answer, the instinct is to blame the language model: perhaps the model hallucinated, the prompt was poorly structured, or the temperature was too high. These are reasonable hypotheses, and they are usually wrong.

The far more common failure mode is quieter and more fundamental: the retrieval step returned the wrong chunks. The model was asked to answer a question about quarterly revenue, but the retriever surfaced chunks about annual projections. The model was asked about a specific API endpoint, but the retriever returned documentation for a different version. In each case, the language model did exactly what it was told and synthesized the context it received; the context was simply wrong.

This is the retrieval quality problem. It sits at the center of every production RAG system, and solving it requires understanding a set of tradeoffs that information retrieval researchers have studied for decades. The tools have changed. The tradeoffs have not.

Precision and Recall: The Fundamental Tension

Information retrieval has always been defined by two competing metrics: precision and recall. Precision asks, "of the documents we returned, how many were actually relevant?" Recall asks, "of all the relevant documents that exist, how many did we find?"

In a RAG system, these map to concrete failure modes. Low precision means you are stuffing your context window with irrelevant chunks, diluting the signal with noise. The language model has to work harder to find the useful information, and it may latch onto irrelevant passages instead. Liu et al. (2023) demonstrated this vividly in their "Lost in the Middle" paper, showing that language models struggle to use relevant information when it appears buried among irrelevant context.

Low recall means you missed something important. The answer existed in your knowledge base, but the retriever never found it. The language model, lacking the necessary context, either refuses to answer or confabulates one from whatever partial information it received.

Which matters more depends on your use case. A legal research system needs high recall because missing a relevant precedent could be malpractice. A customer support chatbot might prioritize precision because returning a single correct answer quickly is more valuable than surfacing every possibly relevant FAQ. A medical question-answering system arguably needs both, which is precisely what makes it so difficult to build well.

The challenge is that optimizing for one typically degrades the other. Cast a wider net and you improve recall at the cost of precision. Tighten your filters and precision improves while relevant documents slip through. Every retrieval strategy represents a position on this tradeoff curve.

Dense Retrieval: Searching by Meaning

Dense retrieval, often called semantic search, represents the modern approach to the problem. The core idea is straightforward: encode both queries and documents as high-dimensional vectors (embeddings), then find documents whose vectors are close to the query vector in embedding space. Two pieces of text that mean similar things should end up near each other, regardless of whether they share any words.

This works remarkably well for many cases. A user who searches for "how to fix a slow database" will match documents about "optimizing query performance" and "improving database throughput," even though the vocabulary barely overlaps. The embedding model has learned that these concepts are semantically related. This ability to bridge the gap between different phrasings of the same idea is the fundamental advantage of dense retrieval.

But dense retrieval has a significant blind spot, sometimes called the vocabulary mismatch problem, though in practice it manifests as something closer to vocabulary erasure. Embedding models compress text into fixed-dimensional vectors, and that compression is lossy. Specific terms, especially rare ones, often get smoothed away in favor of general semantic meaning.

Consider a user searching for "CEO compensation at Acme Corp." The relevant document might say "the chief executive officer received total remuneration of $4.2 million." A good embedding model might bridge "CEO" to "chief executive officer," but it may struggle with "Acme Corp" if that entity was underrepresented in the model's training data. Proper nouns, product codes, serial numbers, medical abbreviations, legal citation formats: these are all cases where the specific string matters as much as its meaning.

The problem compounds with short, precise queries. When a user types "ERR-4092" looking for an error code, dense retrieval may return documents about errors in general rather than the specific error code. The embedding captured the concept of "error" perfectly well. It just lost the part that mattered most.

Choosing an Embedding Model

The choice of embedding model shapes everything downstream. Smaller models like all-MiniLM-L6-v2 (384 dimensions) are fast and cheap but sacrifice nuance. Larger models like OpenAI's text-embedding-3-large (3072 dimensions) or Cohere's embed-v3 capture finer distinctions but cost more per query and require more storage for the index.

Dimension count matters more than it might seem. Each document in your index occupies memory proportional to its embedding dimension. A corpus of one million chunks at 384 dimensions requires roughly 1.5 GB of float32 vectors. The same corpus at 3072 dimensions requires 12 GB. At scale, this difference drives architectural decisions about whether embeddings live in memory, on disk, or in a managed vector database.

Domain also matters. General-purpose embedding models trained on web text may underperform on specialized corpora, such as biomedical literature, legal filings, or financial reports, because the distributional patterns in those domains differ from the training data. Fine-tuning an embedding model on domain-specific data, even with a relatively small training set, often yields meaningful improvements in retrieval quality. This is one of the highest-leverage interventions available when building a domain-specific RAG system.

Sparse Retrieval: The Persistence of Keywords

Before embeddings, there was BM25. Before BM25, there was TF-IDF. The lineage stretches back to the 1970s, and the core intuition has remained stable: documents are relevant to a query if they share important terms, where "important" roughly means "frequent in this document but rare across the collection."

BM25, formalized by Robertson and Zaragoza (2009) after decades of refinement in the Okapi project at City University London, remains the standard sparse retrieval algorithm. It scores documents based on term frequency, inverse document frequency, and document length normalization. It requires no training, no GPU, and no embedding model. It is fast, interpretable, and surprisingly effective.

Sparse retrieval excels precisely where dense retrieval struggles. Search for "ERR-4092" and BM25 will find every document containing that exact string. Search for "Acme Corp Q3 2024 revenue" and BM25 will prioritize documents containing those specific tokens. There is no compression, no lossy encoding. If the term appears in the document, the algorithm knows.

The weakness is the mirror image of dense retrieval's strength. BM25 has no understanding of meaning. Search for "automobile insurance claims" and it will miss documents about "car insurance disputes" unless they happen to share enough terms. Synonyms, paraphrases, and conceptual similarity are invisible to it. Each query retrieves only from the slice of the corpus that shares its exact vocabulary.

This is not a flaw so much as a design constraint. Sparse retrieval trades semantic understanding for lexical precision. For many real-world queries, that trade is worthwhile.

How BM25 Actually Works

BM25's scoring function, while mathematically compact, encodes several important intuitions. For a query Q containing terms q1, q2, ..., qn, the score for a document D is the sum of each query term's contribution, where each term's contribution depends on three factors.

First, how often does the term appear in this document? More occurrences suggest higher relevance, but with diminishing returns. The tenth occurrence of "retrieval" contributes less than the first. BM25 controls this saturation with a parameter k1, typically set to 1.2.

Second, how long is this document relative to the average? Longer documents naturally contain more term occurrences, so BM25 normalizes for length. The parameter b, typically 0.75, controls how aggressively this normalization is applied. Setting b=0 disables length normalization entirely; setting b=1 applies full normalization.

Third, how rare is this term across the collection? A term that appears in nearly every document (like "the" or "is") carries little discriminative signal. A term that appears in only a handful of documents (like "XR-7500" or "mitochondrial") is highly informative. This inverse document frequency (IDF) weighting ensures that distinctive terms drive the ranking.

These three factors, term frequency saturation, document length normalization, and inverse document frequency, combine to produce a scoring function that has proven remarkably hard to beat for keyword-based retrieval, despite decades of attempts.

Hybrid Search: Combining What Works

If dense retrieval captures meaning and sparse retrieval captures terms, the obvious question is: why not use both? This is the premise of hybrid search, and in practice it outperforms either approach alone for the majority of RAG workloads.

The mechanics are simple. Run the query through both a dense retriever and a sparse retriever. Each produces a ranked list of documents. Merge the two lists into a single ranking.

The merging step is where things get interesting. The most widely used approach is Reciprocal Rank Fusion (RRF), introduced by Cormack, Clarke, and Butt (2009). RRF does not require the scores from different retrievers to be on the same scale. Instead, it uses only the rank positions. For each document, it computes:

# Reciprocal Rank Fusion
# For each document d, sum across all ranking systems:
# RRF_score(d) = sum( 1 / (k + rank_i(d)) )
# where k is a constant (typically 60) and rank_i is the
# rank of document d in the i-th retrieval system

def reciprocal_rank_fusion(ranked_lists, k=60):
    """Merge multiple ranked lists using RRF."""
    fused_scores = {}

    for ranked_list in ranked_lists:
        for rank, doc_id in enumerate(ranked_list, start=1):
            if doc_id not in fused_scores:
                fused_scores[doc_id] = 0.0
            fused_scores[doc_id] += 1.0 / (k + rank)

    # Sort by fused score, descending
    return sorted(
        fused_scores.items(),
        key=lambda x: x[1],
        reverse=True
    )

The constant k=60 controls how much weight is given to rank differences. Higher values of k dampen the impact of rank position, making the fusion more uniform. The original paper found k=60 to be robust across datasets, and most practitioners stick with it.

An alternative to RRF is weighted linear combination, where you normalize the scores from each retriever to a common scale and take a weighted sum. This gives you a tunable parameter (the weight ratio between dense and sparse) but requires that scores be meaningfully comparable across systems, which is often harder than it sounds.

Where Hybrid Wins: A Concrete Example

Consider a knowledge base for an electronics manufacturer. A user asks: "What is the operating temperature range for the XR-7500 sensor module?"

The dense retriever understands this is a question about hardware specifications and temperature tolerances. It returns chunks about sensor specifications, thermal management guidelines, and environmental testing procedures. Relevant in theme, but it may rank a general overview of the XR series above the specific datasheet for the XR-7500 because the embeddings for those documents are semantically close.

The sparse retriever finds every document containing "XR-7500." It does not understand the question is about temperature, but it nails the product identifier. The specific datasheet appears near the top.

Hybrid search, via RRF, promotes the XR-7500 datasheet to the top of the fused list because it ranks well in both systems. The general sensor overview ranks well in the dense list but poorly in the sparse list (it never mentions "XR-7500" by name), so it drops in the fused ranking. The right document surfaces.

Now consider the opposite scenario. A user asks: "How do I handle intermittent connectivity issues with IoT sensors in high-humidity environments?" No product codes, no exact terms to match. The sparse retriever flounders, matching on common words like "sensor" and "connectivity" without understanding the conceptual query. The dense retriever recognizes this as a question about reliability engineering in challenging environments and surfaces the right technical notes. In the fused list, the dense retriever's strong signal dominates because the sparse retriever's contributions are scattered and low-confidence.

This complementarity is why hybrid search is the default recommendation for production RAG systems. It is not always the best approach for every query, but it is the most robust across the full distribution of queries your system will encounter.

Implementing Hybrid Search

↗ docs
from sentence_transformers import SentenceTransformer
from rank_bm25 import BM25Okapi
import numpy as np

class HybridRetriever:
    """Combines dense (embedding) and sparse (BM25) retrieval."""

    def __init__(self, documents, model_name="all-MiniLM-L6-v2"):
        self.documents = documents
        self.model = SentenceTransformer(model_name)

        # Build dense index
        self.embeddings = self.model.encode(documents)

        # Build sparse index
        tokenized = [doc.lower().split() for doc in documents]
        self.bm25 = BM25Okapi(tokenized)

    def dense_search(self, query, top_k=50):
        """Retrieve by semantic similarity."""
        query_emb = self.model.encode([query])
        scores = np.dot(self.embeddings, query_emb.T).flatten()
        top_indices = np.argsort(scores)[::-1][:top_k]
        return top_indices.tolist()

    def sparse_search(self, query, top_k=50):
        """Retrieve by BM25 keyword matching."""
        tokenized_query = query.lower().split()
        scores = self.bm25.get_scores(tokenized_query)
        top_indices = np.argsort(scores)[::-1][:top_k]
        return top_indices.tolist()

    def hybrid_search(self, query, top_k=10, rrf_k=60):
        """Combine dense and sparse results with RRF."""
        dense_results = self.dense_search(query, top_k=50)
        sparse_results = self.sparse_search(query, top_k=50)

        # Apply Reciprocal Rank Fusion
        fused = reciprocal_rank_fusion(
            [dense_results, sparse_results], k=rrf_k
        )

        # Return top_k document indices and scores
        return [(doc_id, score) for doc_id, score in fused[:top_k]]

This implementation is deliberately minimal. A production system would add batched encoding, approximate nearest neighbor search (FAISS or Annoy for the dense index), and more sophisticated tokenization for BM25. But the architecture, two parallel retrieval paths merged by RRF, is the same whether you are running on a laptop or serving millions of queries.

Reranking: The Second Pass

Hybrid search improves which documents enter the candidate pool. Reranking improves how those candidates are ordered. The two-stage pipeline, cheap retrieval followed by expensive reranking, is one of the most effective patterns in modern information retrieval.

The intuition is economic. Dense and sparse retrievers are fast because they process the query and each document independently. The dense retriever encodes the query once, then computes similarity against pre-computed document embeddings. BM25 does a lookup in an inverted index. Both can search millions of documents in milliseconds.

Cross-encoder rerankers are different. They take the query and a candidate document as a single input and process them jointly, allowing deep interaction between query and document tokens through the transformer's attention mechanism. This joint processing captures relevance signals that independent encoding simply cannot, like whether a document actually answers the question rather than merely discussing the same topic.

The cost is that cross-encoders must process each (query, document) pair separately. You cannot pre-compute document representations because the representation depends on the query. Running a cross-encoder against a million documents would take minutes or hours. Running it against 50 candidates takes a fraction of a second.

Nogueira and Cho (2019) demonstrated that cross-encoder reranking, built on BERT, dramatically improved retrieval quality over first-stage retrievers on standard benchmarks. The approach has since become standard practice, with models like Cohere Rerank, BGE-reranker, and various fine-tuned cross-encoders available off the shelf.

The Two-Stage Pipeline

The pattern works as follows. The first stage (hybrid search) casts a wide net, retrieving 50 to 100 candidate chunks. Recall is the priority here. You want to make sure the relevant documents are somewhere in the candidate set, even if the ranking is imperfect.

The second stage (reranking) refines the ordering, scoring each candidate against the query with a cross-encoder and keeping only the top 5 to 10. Precision is the priority here. The context window you send to the LLM is limited, and every irrelevant chunk displaces a useful one.

Stage 1 maximizes recall over the full corpus; stage 2 spends compute only on the candidates that survived. Inverting the order is what makes the pipeline tractable.

↗ docs
from sentence_transformers import CrossEncoder

class RerankedRetriever:
    """Two-stage retrieval: hybrid search + cross-encoder reranking."""

    def __init__(self, hybrid_retriever):
        self.retriever = hybrid_retriever
        self.reranker = CrossEncoder(
            "cross-encoder/ms-marco-MiniLM-L-6-v2"
        )

    def search(self, query, first_stage_k=50, final_k=5):
        """Retrieve candidates, then rerank."""

        # Stage 1: Cast a wide net with hybrid search
        candidates = self.retriever.hybrid_search(
            query, top_k=first_stage_k
        )
        candidate_indices = [doc_id for doc_id, _ in candidates]

        # Prepare (query, document) pairs for the cross-encoder
        pairs = [
            (query, self.retriever.documents[idx])
            for idx in candidate_indices
        ]

        # Stage 2: Rerank with cross-encoder
        rerank_scores = self.reranker.predict(pairs)

        # Sort by reranker score and return top results
        scored = list(zip(candidate_indices, rerank_scores))
        scored.sort(key=lambda x: x[1], reverse=True)

        return [
            (idx, score, self.retriever.documents[idx])
            for idx, score in scored[:final_k]
        ]

The performance gain from reranking is often substantial. In practice, teams report 10-25% improvements in retrieval relevance metrics after adding a reranker to their pipeline. The reason is intuitive: a bi-encoder (used in dense retrieval) compresses each text into a single vector before comparison, which inevitably loses nuance. A cross-encoder sees both texts simultaneously, allowing it to recognize subtle relevance signals like whether a passage actually contains the answer or merely discusses adjacent concepts.

The latency cost is real but manageable. Reranking 50 candidates with a small cross-encoder typically adds 50-200 milliseconds, depending on document length and hardware. For most RAG applications, that tradeoff is worthwhile.

Query Transformation: Making the Question Better

Sometimes the problem is not the retriever but the query. Users ask vague questions, use ambiguous terms, or frame their information need in a way that does not align with how the knowledge base is written. Query transformation techniques address this by rewriting or augmenting the query before it reaches the retriever.

HyDE: Hypothetical Document Embeddings

HyDE, introduced by Gao et al. (2022), is one of the more creative approaches to query transformation. The idea is counterintuitive: instead of searching with the user's query directly, first ask an LLM to generate a hypothetical answer, then use that hypothetical answer as the search query.

Why would searching with a fabricated answer work better than searching with the original question? Because documents in your knowledge base are written in the style of answers, not questions. A user asks "What causes battery degradation in lithium-ion cells?" but the relevant document reads "Lithium-ion battery capacity loss results from several mechanisms including SEI layer growth, lithium plating, and cathode structural degradation." The hypothetical answer, even if factually imperfect, is stylistically closer to the target document than the question was.

↗ docs
from openai import OpenAI

def hyde_search(query, retriever, llm_client):
    """Generate a hypothetical answer, then search with it."""

    # Step 1: Generate a hypothetical document
    response = llm_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": (
                f"Write a short paragraph that would answer "
                f"this question:\n{query}\n\n"
                f"Write as if you are a technical document. "
                f"Be specific and factual."
            )
        }],
        max_tokens=150
    )
    hypothetical_doc = response.choices[0].message.content

    # Step 2: Search using the hypothetical document
    # as the query (dense search benefits most from this)
    return retriever.dense_search(hypothetical_doc)

HyDE works best when the embedding model struggles with question-to-document matching but performs well on document-to-document matching. It adds one LLM call of latency, which may or may not be acceptable depending on your response time requirements. It also inherits the LLM's biases, so the hypothetical answer may steer retrieval in unintended directions for ambiguous queries.

Multi-Query Retrieval

A different approach to query transformation is to generate multiple reformulations of the original query and run each through the retriever independently. The union of results across all query variants often achieves better recall than any single formulation.

↗ docs
def multi_query_search(query, retriever, llm_client, n_queries=3):
    """Generate query variants and merge results."""

    # Generate alternative phrasings
    response = llm_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": (
                f"Generate {n_queries} different ways to ask "
                f"this question. Return one per line, no "
                f"numbering:\n{query}"
            )
        }]
    )
    variants = response.choices[0].message.content.strip().split("\n")

    # Search with each variant
    all_ranked_lists = []
    for variant in [query] + variants:
        results = retriever.hybrid_search(variant, top_k=20)
        all_ranked_lists.append([doc_id for doc_id, _ in results])

    # Fuse all result lists
    return reciprocal_rank_fusion(all_ranked_lists)

This works because different phrasings activate different parts of the index. The original query "impact of sleep deprivation on memory" might miss a document titled "Cognitive Effects of Insufficient Rest," but a variant like "how does lack of sleep affect cognitive function" might catch it. The more angles you search from, the fewer relevant documents slip through.

The cost is linear in the number of query variants: more retriever calls, more latency. In practice, three to five variants represent a reasonable tradeoff between recall improvement and added latency.

Query Expansion

A simpler form of query transformation is query expansion, where you append related terms to the original query without generating full reformulations. This can be as straightforward as adding synonyms or as sophisticated as using a language model to identify related concepts.

For sparse retrieval specifically, query expansion is powerful because it directly addresses the vocabulary mismatch problem. Expanding "CEO compensation" with "chief executive officer salary remuneration pay" ensures that BM25 can match documents using any of those terms. The technique has roots in classical IR research going back to the 1960s, with relevance feedback and thesaurus-based expansion predating modern language models by decades.

Step-Back Prompting

Step-back prompting, introduced by Zheng et al. (2023) at Google DeepMind, takes a different approach to query transformation. Instead of rephrasing the original question or generating a hypothetical answer, you first ask the LLM to produce a more general "step-back" question that abstracts away from the specifics. You then search with both the original query and the step-back query, merging the results.

The motivation is straightforward. Highly specific questions often use narrow terminology that matches only a small slice of the relevant literature. The step-back question broadens the search surface without abandoning the original specificity, because both queries contribute to the final result set.

Consider a user who asks: "What is the degradation rate of lithium iron phosphate batteries at 45 degrees Celsius?" This is a precise question, and the retriever may find only documents that mention that exact chemistry and temperature. A step-back version of the question might be: "What factors affect lithium-ion battery degradation?" That broader query retrieves foundational documents about battery chemistry, thermal effects, and degradation mechanisms that the original query would miss entirely. The union of both result sets gives the LLM both the specific data point and the surrounding context needed to produce a thorough answer.

↗ docs
def step_back_search(query, retriever, llm_client):
    """Generate a step-back question, then search with both."""

    # Step 1: Ask the LLM to generate a broader question
    response = llm_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": (
                f"Given the following question, generate a more "
                f"general step-back question that captures the "
                f"broader topic or principle behind it.\n\n"
                f"Original question: {query}\n\n"
                f"Step-back question:"
            )
        }],
        max_tokens=100
    )
    step_back_query = response.choices[0].message.content.strip()

    # Step 2: Search with both the original and step-back queries
    original_results = retriever.hybrid_search(query, top_k=30)
    step_back_results = retriever.hybrid_search(step_back_query, top_k=30)

    # Step 3: Merge results using RRF
    return reciprocal_rank_fusion([
        [doc_id for doc_id, _ in original_results],
        [doc_id for doc_id, _ in step_back_results],
    ])

Like HyDE, step-back prompting adds one LLM call of latency to the retrieval pipeline. However, the step-back question is simpler to generate than a full hypothetical document, so the added latency is typically smaller. The LLM only needs to abstract the question, not fabricate a plausible answer.

Step-back prompting works particularly well for highly technical or domain-specific queries where the user's terminology may not match the vocabulary used in the knowledge base. A researcher asking about a specific protein interaction pathway benefits from also retrieving general documents about the protein family and the interaction mechanism. A developer asking about a particular Kubernetes error benefits from also retrieving documents about the broader subsystem involved. The step-back question acts as a bridge between the user's precise framing and the corpus's broader coverage.

Evaluation: Measuring What Matters

Building a retrieval pipeline without evaluation is like tuning an engine without a dynamometer. You might get lucky. You probably will not. Rigorous evaluation requires three things: a set of queries, a set of relevant documents for each query (ground truth), and metrics that capture what you care about.

Building Ground Truth

The ground truth is the hardest part. For each query in your evaluation set, you need to know which documents in your corpus are relevant. There is no shortcut here. Someone, either a domain expert or a carefully prompted LLM with human verification, must label the relevance judgments.

A practical approach for RAG systems is to start from real user queries. Collect 50 to 100 representative questions from your application logs. For each question, have an annotator identify the 3 to 10 chunks in your knowledge base that contain the answer. This gives you a retrieval evaluation set that reflects actual usage patterns rather than synthetic benchmarks.

# Example evaluation set structure
eval_set = [
    {
        "query": "What is the maximum payload for the XR-7500?",
        "relevant_doc_ids": [42, 43, 107],
        "notes": "Answer in datasheet (42) and spec summary (43, 107)"
    },
    {
        "query": "How do I configure TLS mutual authentication?",
        "relevant_doc_ids": [215, 216, 220],
        "notes": "Setup guide (215-216) and troubleshooting (220)"
    },
    # ... 50-100 more queries
]

Key Metrics

Recall@K measures the proportion of relevant documents that appear in the top K results. If there are 5 relevant documents and 3 appear in the top 10, Recall@10 is 0.6. This is arguably the most important metric for RAG because it tells you whether the retriever is finding the information the LLM needs. A retriever with Recall@10 of 0.95 gives the LLM a fighting chance. One with Recall@10 of 0.4 does not.

MRR (Mean Reciprocal Rank) measures how quickly the first relevant document appears. If the first relevant result is at position 3, the reciprocal rank is 1/3. Average this across all queries and you get MRR. High MRR means relevant results appear near the top, which matters when your context window is limited and position affects LLM attention.

NDCG (Normalized Discounted Cumulative Gain) is the most sophisticated of the three. It accounts for the relevance grade of each document (not just binary relevant/irrelevant) and applies a logarithmic discount based on position. A highly relevant document at position 1 contributes more than a marginally relevant document at position 5. NDCG is especially useful when your relevance judgments have multiple levels, such as "perfect answer," "partially relevant," and "tangentially related."

import numpy as np

def recall_at_k(retrieved_ids, relevant_ids, k):
    """Proportion of relevant docs found in top-k results."""
    retrieved_set = set(retrieved_ids[:k])
    relevant_set = set(relevant_ids)
    return len(retrieved_set & relevant_set) / len(relevant_set)

def mrr(retrieved_ids, relevant_ids):
    """Reciprocal rank of the first relevant document."""
    relevant_set = set(relevant_ids)
    for rank, doc_id in enumerate(retrieved_ids, start=1):
        if doc_id in relevant_set:
            return 1.0 / rank
    return 0.0

def ndcg_at_k(retrieved_ids, relevance_scores, k):
    """NDCG with graded relevance judgments.

    relevance_scores: dict mapping doc_id -> relevance grade
    """
    # DCG for the retrieved ranking
    dcg = 0.0
    for i, doc_id in enumerate(retrieved_ids[:k]):
        rel = relevance_scores.get(doc_id, 0)
        dcg += (2 ** rel - 1) / np.log2(i + 2)

    # Ideal DCG (perfect ranking)
    ideal_rels = sorted(
        relevance_scores.values(), reverse=True
    )[:k]
    idcg = sum(
        (2 ** rel - 1) / np.log2(i + 2)
        for i, rel in enumerate(ideal_rels)
    )

    return dcg / idcg if idcg > 0 else 0.0

def evaluate_retriever(retriever, eval_set, k=10):
    """Run full evaluation across a test set."""
    metrics = {"recall": [], "mrr": []}

    for example in eval_set:
        results = retriever.hybrid_search(
            example["query"], top_k=k
        )
        retrieved_ids = [doc_id for doc_id, _ in results]

        metrics["recall"].append(
            recall_at_k(retrieved_ids, example["relevant_doc_ids"], k)
        )
        metrics["mrr"].append(
            mrr(retrieved_ids, example["relevant_doc_ids"])
        )

    return {
        "mean_recall@k": np.mean(metrics["recall"]),
        "mean_mrr": np.mean(metrics["mrr"]),
    }

The Evaluation Loop

With ground truth and metrics in place, evaluation becomes systematic. Change a parameter, such as the chunk size, the embedding model, or the BM25 weight in hybrid search, then rerun the evaluation set and compare metrics. This is how you make principled decisions about your retrieval pipeline rather than relying on vibes and spot checks.

One critical point: always evaluate retrieval independently from generation. If you only measure end-to-end answer quality, you cannot tell whether a regression was caused by worse retrieval, a bad prompt template, or a model API change. Separating the evaluation layers lets you diagnose problems precisely. When retrieval metrics drop, fix the retriever. When generation quality drops despite good retrieval, fix the prompt or the model.

Synthetic Evaluation Data

Building ground truth by hand is expensive, so many teams bootstrap their evaluation sets using LLMs. The approach works like this: take a chunk from your knowledge base, prompt a language model to generate a question that the chunk would answer, and use that (question, chunk) pair as a labeled example. This is fast and scalable, but the synthetic questions tend to be easier than real user queries because the LLM has seen the answer before generating the question.

A reasonable workflow is to start with synthetic data to get your pipeline off the ground, then gradually replace synthetic examples with real queries annotated by domain experts. Even 30 to 50 expert-annotated queries provide far more signal than 500 synthetic ones, because they reflect the actual distribution of user information needs, including the ambiguous, poorly phrased, and genuinely hard questions that synthetic generation tends to miss.

Continuous Evaluation

Retrieval quality is not something you measure once and forget. Your knowledge base changes as documents are added, updated, and removed. Your query distribution shifts as users discover new features or encounter new problems. An embedding model that performed well six months ago may be outpaced by newer alternatives.

Integrate retrieval evaluation into your CI/CD pipeline. Run your evaluation set against every significant change to the retrieval stack, whether that is a new embedding model, a modified chunking strategy, or an updated document corpus. Track metrics over time. Regressions caught in testing are cheap to fix; regressions discovered by users are expensive in every sense.

Putting It All Together

A production RAG retrieval pipeline typically looks something like this:

Query transformation (optional): Expand or rewrite the query using HyDE, multi-query, step-back prompting, or simple synonym expansion.
Hybrid retrieval: Run the (possibly transformed) query through both dense and sparse retrievers. Merge results with RRF to get 50-100 candidates.
Reranking: Score each candidate with a cross-encoder. Keep the top 5-10.
Context assembly: Format the top chunks into the LLM's context window, ordered by relevance score.

The full pipeline is modular. Skip Stage 1 for unambiguous queries; skip Stage 3 for low-stakes uses. Each stage exists because the previous one has a known failure mode.

Not every system needs every stage. A simple internal knowledge base might work well with just hybrid search and no reranking. A high-stakes medical or legal system might add query expansion, reranking, and even a third verification pass. The architecture is modular by design. You add complexity where the evaluation metrics tell you it is needed.

The key insight is that retrieval quality is not a single problem but a pipeline of problems, each with its own failure modes and solutions. Dense search fails on specific terms, so you add sparse search; first-stage retrieval is imprecise, so you add reranking; user queries are ambiguous, so you add query transformation. Each stage compensates for the weaknesses of the one before it.

The Chunk Size Question

Before any retrieval can happen, documents must be split into chunks. The choice of chunk size ripples through every stage of the pipeline, and there is no universally correct answer.

Small chunks (100-200 tokens) produce precise embeddings because the vector represents a focused passage. Retrieval tends to be more accurate for specific factual queries. But small chunks lose context. A passage explaining a concept across several paragraphs gets fragmented, and the individual fragments may not make sense in isolation. When the LLM receives these fragments, it may struggle to reconstruct the full picture.

Large chunks (500-1000 tokens) preserve more context per passage, which is helpful for complex topics that cannot be captured in a single paragraph. But larger chunks produce less precise embeddings because the vector must represent a broader range of content. A chunk that discusses both pricing and technical specifications will match queries about either topic, reducing precision for both.

A common middle ground is chunks of 256 to 512 tokens with 50-100 tokens of overlap between consecutive chunks. The overlap ensures that information at chunk boundaries is not lost. Some teams use hierarchical chunking, storing both a summary embedding for large sections and detailed embeddings for smaller passages within those sections, searching at both levels simultaneously.

The right chunk size is empirical. Measure it. Build your evaluation set, try three or four chunk sizes, and let the metrics decide. The optimal size depends on the nature of your documents, the types of queries you receive, and the capacity of your context window. What works for a FAQ database will not work for a collection of research papers.

Common Failure Modes

Even well-designed retrieval pipelines fail in predictable ways. Recognizing these patterns is the first step toward fixing them.

Stale embeddings. You update a document in your knowledge base but forget to re-embed it. The vector index still contains the old embedding, which may not match queries about the updated content. This is operationally trivial but causes real production incidents. Every document update pipeline needs an embedding refresh step.

Chunk boundary artifacts. The answer to a query spans two chunks, but only one is retrieved. The LLM receives half the answer and either confabulates the rest or produces something misleading. Overlapping chunks and parent-document retrieval (where you retrieve the surrounding context of a matching chunk) help mitigate this.

Embedding model drift. You upgrade your embedding model for better quality but forget that the new model produces vectors in a different space. Your existing index, built with the old model, is now incompatible. You must re-embed the entire corpus when changing embedding models, which is obvious in principle but easy to overlook in practice.

Score threshold traps. Some systems filter results by a minimum similarity score. This seems reasonable but fails for queries that are genuinely difficult, returning nothing when returning the best available (even if imperfect) result would be more useful. Prefer rank-based cutoffs (top K) over score-based cutoffs in most cases.

The language model at the end of this pipeline is only as good as the context it receives. You can swap in a more powerful model, refine your prompt template, and tune your generation parameters, but none of that matters if the retriever is surfacing the wrong documents. Fix retrieval first. Everything downstream improves.

. . .

References

Textbook grounding and extended commentary: Sources.

Robertson, S. & Zaragoza, H. (2009). "The Probabilistic Relevance Framework: BM25 and Beyond." Foundations and Trends in Information Retrieval, 3(4), 333-389.
Gao, L., Ma, X., Lin, J., & Callan, J. (2022). "Precise Zero-Shot Dense Retrieval without Relevance Labels." arXiv:2212.10496.
Nogueira, R. & Cho, K. (2019). "Passage Re-ranking with BERT." arXiv:1901.04085.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020.
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). "Lost in the Middle: How Language Models Use Long Contexts." arXiv:2307.03172.
Cormack, G. V., Clarke, C. L. A., & Butt, S. (2009). "Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods." SIGIR 2009.
Zheng, Z., et al. (2023). "Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models." arXiv:2310.06117.

Retrieval Quality RAG BM25 Hybrid Search Re-ranking Stratified Evaluation