Re-ranking: The Second Chance
First-pass retrieval optimizes for recall, not precision. The top fifty documents from a BM25 or vector retriever are a candidate set, not a final ranking. The reranker is the second-stage scorer that reads each query-document pair together and produces a relevance signal the first pass could not afford to compute. The two-stage pattern is the floor of every production RAG stack in 2026, not an optional addition.
Why Retrieval Needs a Second Pass
First-pass retrievers are tuned for recall. They have to scan the whole corpus and return a candidate set within a tight latency budget. They cannot afford to deeply analyze every query-document pair, so they use scoring shortcuts: term overlap weighted by IDF (BM25), or cosine similarity between independently-encoded vectors (dense retrieval). These shortcuts are computationally cheap, which is why they work over millions of documents. They are also coarse, which is why a top-50 result list typically contains some documents that are not actually relevant.
The standard production response is to add a second stage. Take the top N candidates from the first pass, where N is typically 50 to 200. Score each candidate with a slower, more accurate model. Sort by the new scores. Return the top K, where K is typically 10. The reranker's job is to fix the ordering, not expand the set. If the right document is not in the top-N from the first pass, no amount of reranking will recover it. Recall is the first pass's responsibility; precision is the reranker's.
The numbers that matter for sizing N and K come from the eval discipline the measuring retrieval article walked. Recall@N tells you what fraction of relevant documents the first pass surfaces in its top N. If Recall@100 is 95% on your judged eval set, you can rerank confident that the relevant document is almost always present in the candidate set. If Recall@100 is 60%, fix the first pass before reaching for a reranker.
Bi-Encoders versus Cross-Encoders
The architectural distinction is the one thing to understand cold. It is what makes two-stage retrieval possible at all.
A bi-encoder encodes the query and the document independently. Each becomes a vector in the same embedding space. Relevance is the cosine similarity (or dot product) between the two vectors. The document vectors can be precomputed at index time, so at query time the system needs to compute only the query vector and a similarity score against each candidate. Sub-linear search via HNSW or IVF makes the candidate-set step efficient even at corpus sizes of hundreds of millions of documents.
A cross-encoder feeds the query and the document into the model as a single sequence:
[CLS] query tokens [SEP] document tokens [SEP]
The full transformer attention sees both at once. Every query token can attend to every document token. The model outputs a single relevance score for the pair, typically from the projected [CLS] hidden state. Nothing can be precomputed: the score depends on the joint sequence, which is unique per query-document pair. At query time, the cross-encoder must run one forward pass per candidate.
The asymmetry in compute cost is what enforces the two-stage pattern:
- Bi-encoder per query at query time: one query encoding plus M similarity computations, where M is the corpus size, reduced to a small constant by approximate-nearest-neighbor indexes.
- Cross-encoder per query at query time: N forward passes through a full transformer, where N is the candidate-set size. Each pass is several orders of magnitude more expensive than a cosine similarity.
For a corpus of ten million documents and N = 100 candidates, the bi-encoder is the only feasible first pass and the cross-encoder is the only sufficiently accurate second pass. They are complementary, not competing.
The intuition for why cross-encoders are more accurate: a bi-encoder has to compress every document into a single fixed-length vector that captures all the things any future query might ask about. A cross-encoder receives the query alongside the document and can model the interaction directly. The cross-encoder is not smarter; it has access to information the bi-encoder cannot fit through its encoding bottleneck.
The Two-Stage Pattern
The canonical pipeline:
- Query arrives. Optional: query-side transformation (HyDE, decomposition, step-back) runs first, producing one or more retrieval queries.
- First-stage retrieval against the index. Usually a hybrid of BM25 and dense vectors, fused by RRF or a similar primitive. Returns the top N candidates (typically 50 to 200).
- Reranker scores each of the N candidates against the original query. Sorts by relevance score.
- Top K are returned (typically 10) for the LLM to read.
The constants matter. N too small and the reranker has nothing to reorder; N too large and reranker compute becomes the latency bottleneck. K is bounded by the LLM's effective context window for the retrieved chunks, which is smaller than the model's maximum context window because the LLM also needs the user query, system prompt, and reasoning space.
Two common variations on the base pattern:
- Hybrid first-stage with RRF: run BM25 and a dense retriever in parallel, fuse their ranked lists via Reciprocal Rank Fusion, then rerank the fused candidate set. This is the pattern Anthropic's Contextual Retrieval recipe uses and the one Elasticsearch ships as a first-class retriever primitive.1
- Multi-pass reranking: a cheap first reranker narrows N=100 down to 30 candidates, then an expensive second reranker scores the 30 and produces the final top 10. Useful when the most accurate reranker is too slow to run on 100 candidates.
The Cross-Encoder in Detail
The first cross-encoder built specifically for relevance ranking was a BERT-based architecture trained on MS-MARCO in 2019.2 The recipe was simple: take a pretrained BERT, concatenate the query and document with a [SEP] token, fine-tune on labeled passage-ranking pairs. The output is a relevance score from the [CLS] token. The simplicity is the point. The architecture has barely changed in seven years; what has changed is the size of the underlying model, the size of the training data, and the breadth of domains the released models cover.
Training signal comes from supervised relevance labels. The two standard sources:
- MS-MARCO passage ranking: about 500,000 queries from real Bing search logs, each paired with a relevant passage and (typically) many negative passages. The labels are sparse and noisy but plentiful.
- TREC Deep Learning: annual benchmarks with denser human judgments on a smaller set of queries. The judgments are the gold standard for evaluation; MS-MARCO labels are the gold standard for training.
Loss functions are typically binary cross-entropy on positive-versus-negative pairs or a contrastive ranking loss that pushes positive pairs above negatives by a margin. The released open-source rerankers cite both kinds of training in their model cards.
One technical complication worth naming: cross-encoders inherit the context window of the underlying transformer. A 512-token model can see at most 512 tokens of query plus document combined. Documents longer than 500 tokens must be chunked before reranking, and the reranker scores each chunk independently. This is the same chunking problem the dense retrievers face, and the same long-document remedies apply: late chunking, hierarchical reranking, document-level pooling of chunk scores.
Reranker Selection in 2026
The reranker landscape divides cleanly into open-source and hosted commercial options. The choice usually comes down to latency budget, multilingual requirements, and whether the deployment runs in a network with reliable outbound API access.
| Reranker | Type | Parameters | Notable for |
|---|---|---|---|
| ms-marco-MiniLM-L-6-v2 | Open source | 22M | Speed; the workhorse 2019 baseline still in active use |
| BAAI/bge-reranker-v2-m3 | Open source | 568M | Multilingual coverage; strong on BEIR zero-shot |
| BAAI/bge-reranker-v2-gemma | Open source | 2B | Larger context, distilled from Gemma |
| Jina Reranker v2 | Open source | 278M | Long-context (8K tokens), production-grade |
| Cohere Rerank 3.5 | Hosted API | Undisclosed | 100+ languages, low-friction integration |
| Voyage Rerank 2 | Hosted API | Undisclosed | Domain-tuned variants for code, finance, legal |
Rough latency numbers on commodity GPU hardware, for N = 100 candidates per query:
- 22-million-parameter MiniLM: 5 to 15 milliseconds per query, fits comfortably in any latency budget that allows a reranker at all.
- 500-million-parameter bge-v2-m3: 30 to 80 milliseconds per query on a single GPU, faster if batched.
- 2-billion-parameter bge-v2-gemma: 150 to 400 milliseconds per query; usually needs batching or a dedicated GPU.
- Cohere Rerank (API): 50 to 200 milliseconds network plus inference, depending on candidate count and region.
The MTEB Reranking leaderboard is the standard public benchmark for comparing rerankers across domains.3 Position on the leaderboard correlates with general reranking quality but does not predict in-domain performance, particularly for legal, medical, or specialized technical corpora where fine-tuning on local relevance labels typically dominates leaderboard rankings.
Reciprocal Rank Fusion as the Other Tool
The fixing the query article covered Reciprocal Rank Fusion in detail. It belongs in the reranking conversation too because the two techniques compose cleanly.
RRF combines multiple ranked lists into one without needing to know the raw scores. For each document, its RRF score is the sum across input lists of 1 over (60 plus its rank in that list). Documents that appear high in multiple lists win. The constant 60 is the SIGIR 2009 default that every major implementation inherits.4
The natural composition with reranking:
- Run BM25 against the corpus. Take top 100.
- Run dense vector retrieval against the corpus. Take top 100.
- Run learned-sparse retrieval (SPLADE, ELSER) if available. Take top 100.
- Fuse the three ranked lists with RRF. The fused candidate set has the union of all three.
- Rerank the fused candidate set with a cross-encoder. Return top K.
RRF handles the diversity of the candidate set; reranking handles the precision of the final ordering. The two techniques compose cleanly because they do different jobs.
The Anthropic Contextual Retrieval recipe published in 2024 is the canonical reference for this exact composition: BM25, contextualized dense vectors, RRF fusion, Cohere Rerank as the final pass. Reported numbers showed that adding the reranker on top of the BM25-plus-vector hybrid reduced retrieval failure rates by a further 40 percent over the hybrid alone.1
When Reranking Earns Its Compute
Reranking is not free, and not every query benefits equally. The decision framework has three axes: latency budget, recall headroom, and query type.
Latency budget. The compute cost of reranking is roughly linear in N. A small reranker at N = 100 adds about 50 to 100 milliseconds per query. A medium reranker adds 200 to 500 milliseconds. A large reranker can add a second or more.
Recall headroom. Reranking can only reorder what the first stage surfaced. If Recall@100 is already 95% on a representative eval set, the reranker has room to add precision. If Recall@100 is 60%, the right document is missing from the candidate set for 40% of queries and no reranking will recover it. Fix the first stage first; rerank later.
Query type. Different query distributions benefit unevenly:
- Long natural-language questions: reranking helps a lot. The cross-encoder can match the question's intent against statement-form passages in ways the bi-encoder cannot.
- Exact-identifier queries (SKUs, well IDs, error codes): reranking adds little. BM25 already wins on these by exact term match; a cross-encoder cannot improve a perfect match.
- Multi-faceted technical queries: reranking helps when documents must satisfy multiple constraints. The cross-encoder can recognize a document that addresses all the constraints, where a bi-encoder might rank a document high for matching one constraint strongly.
A practical first move when adding reranking to a system: stratify the eval set by query type, measure first-stage precision and reranker-aided precision separately for each stratum, and confirm the lift is concentrated in the query types where reranking is expected to help. If the lift is uniform across types, the eval is probably under-stratified and is hiding where the win actually comes from.
Production Pitfalls
Latency under load is the first thing teams underestimate. A reranker that takes 80 milliseconds per query on a quiet GPU may take 250 milliseconds under concurrent load. Batching helps throughput at the cost of per-query latency. The right batch size depends on traffic patterns and the underlying serving framework.
Cache locality is the second. Query vectors can be cached, and document vectors are precomputed by definition, but reranker scores depend on the joint sequence and cannot be cached across queries. A reranker hit rate is essentially zero unless the same query appears repeatedly with the same candidate set, which is rare.
Model freshness is the third. A reranker trained on MS-MARCO is a general-purpose model. It performs adequately on most domains but rarely matches a model fine-tuned on in-domain relevance labels. The fine-tuning path is straightforward: collect a few thousand judged pairs from production logs, fine-tune the open-source base model, deploy. Most teams put this off too long because the unfine-tuned baseline works adequately, and then discover their domain-specific failure modes only after a downstream incident.
Cold-start is the fourth. A new RAG system has no judged labels and no production logs. The reranker is therefore a generic MS-MARCO-trained model whose in-domain performance is unknown. LLM-judged relevance labels are the standard workaround: ask a current frontier model (Claude 4.7, GPT-5.x, Gemini 3.x) to grade a sample of query-document pairs against a written rubric, use those judgments as both training data and eval data, monitor for divergence as production data accumulates.
A/B testing reranking changes is the fifth pitfall and the most subtle. Reranking changes the visible top-K. If you measure success by user click-throughs, the existing reranker's choices bias the click distribution: users click what they see. Holdout evaluations have to compare reranked versus un-reranked retrievals on judged eval sets, not on click logs alone. The judged eval set must be representative of the production query distribution, which means refreshing it on a regular cadence.
A Worked Example
The query, in our running oil-and-gas worked-example domain:
What are the recommended remediation procedures for differential sticking in deviated wells with high overbalance?
The first-stage retrieval is a hybrid: BM25 plus a dense vector retriever, fused by RRF. Both return their top 100 against an indexed corpus of drilling-operations documents. The fused candidate set has approximately 130 unique documents (overlap between BM25 and dense is partial, not complete).
Sampling the top 10 of the fused candidate set, before reranking:
- A document titled "Differential Sticking in High-Angle Wells": relevant and well-matched.
- A document titled "Overbalanced Drilling Mud Programs": relevant on overbalance but does not discuss sticking remediation.
- A document titled "Wellbore Stability in the Wolfcamp": partially relevant; mentions overbalance but focuses on stability, not sticking.
- A document titled "Mechanical Sticking versus Differential Sticking": directly relevant but ranked seventh, not first.
- Several documents about deviated-well drilling generally, with no specific treatment of sticking.
After the cross-encoder reranker scores all 130 candidates, the top 10 changes substantially:
- The "Differential Sticking in High-Angle Wells" document moves up to rank 1.
- The "Mechanical Sticking versus Differential Sticking" document moves up to rank 2; the reranker recognizes that distinguishing the two types is highly relevant to a remediation question.
- A document titled "Spotting Fluid Procedures for Stuck Pipe Recovery" appears in the top 5; the document mentions overbalance only in passing, so the bi-encoder ranked it low, but the cross-encoder recognizes that spotting-fluid is the standard remediation for differential sticking.
- The general "Wellbore Stability" document drops out of the top 10. It matched on overbalance but not on sticking; the cross-encoder downweights documents that satisfy one of the three constraints but not the others.
The lift is concentrated in the queries that combine multiple constraints. The reranker is doing the work that the bi-encoder's compressed vector cannot: recognizing which combinations of constraints a document actually satisfies.
Honest Limitations
Reranking is the most consistently useful single addition to a basic RAG retrieval stack. It is not a fix for every failure mode.
Domain shift. A reranker trained on MS-MARCO encodes a particular distribution of query types and relevance criteria. Legal corpora, medical literature, code documentation, and specialized technical domains all show measurable performance drops compared to web-search benchmarks. The remedy is fine-tuning on in-domain labels, which is feasible but costs time the team often does not have at first.
Long documents. Cross-encoders inherit the context window of their underlying transformer. Most are 512 tokens; the larger ones reach 8K. Documents longer than the window must be chunked, and chunks are scored independently. A document whose relevance to the query is distributed across multiple chunks may not surface as well as a document whose relevance is concentrated in a single chunk.
Short queries with little lexical signal. The cross-encoder is most powerful when the query carries enough tokens to interact with the document. Two-word queries give the model little to attend to, and the reranker score collapses toward the prior. For under-specified queries, the right intervention is on the query side (multi-query, query2doc) rather than on the reranker side.
Latency at extreme scale. A reranker running on every query at thousands of queries per second requires substantial GPU capacity. The hosted commercial rerankers shift the operational burden but introduce per-call pricing that adds up quickly at scale. Open-source rerankers running on owned hardware are typically more cost-effective above a certain volume.
Compute is real. The 40 to 200 milliseconds of added latency from a medium reranker is invisible in chat-style RAG but unacceptable in autocomplete or low-latency search. The decision to add reranking is a decision to spend that compute on every query, and the system needs to be sized for it.
References
Textbook grounding, chapter-level citations, and further reading for each numbered reference in this article live on the companion sources page.
- Anthropic. (2024). "Introducing Contextual Retrieval." Anthropic engineering blog. The canonical production recipe combining BM25, contextualized dense vectors, RRF fusion, and Cohere Rerank as the final pass; reports a 49% retrieval-failure reduction over baseline with the full stack.
- Nogueira, R., & Cho, K. (2019). "Passage Re-ranking with BERT." arXiv. The first paper to apply a BERT cross-encoder to passage ranking on MS-MARCO; established the architecture every subsequent reranker inherits.
- Muennighoff, N., Tazi, N., Magne, L., & Reimers, N. (2023). "MTEB: Massive Text Embedding Benchmark." Hugging Face. The standard public benchmark for comparing embedding models and rerankers across tasks and languages.
- Cormack, G. V., Clarke, C. L. A., & Buttcher, S. (2009). "Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods." SIGIR 2009. Establishes the RRF formula and the k=60 default that every modern implementation inherits.
- Khattab, O., & Zaharia, M. (2020). "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT." SIGIR 2020. Introduces late interaction, a middle point between bi-encoder and cross-encoder that preserves per-token document representations.
- Reimers, N., & Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." EMNLP 2019. The Sentence-BERT baseline that made bi-encoder retrieval practical and the reference architecture most open-source bi-encoders still build on.
- Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., & Gurevych, I. (2021). "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models." NeurIPS Datasets and Benchmarks. The standard zero-shot retrieval benchmark; nearly every reranker on the Hugging Face leaderboard reports BEIR numbers.
- BAAI. (2024). "bge-reranker-v2-m3 model card." Hugging Face. The reference open-source multilingual reranker as of 2026, with documented training procedures and BEIR results.