← All Articles

Re-ranking: The Second Chance

First-pass retrieval optimizes for recall, not precision. The top fifty documents from a BM25 or vector retriever are a candidate set, not a final ranking. The reranker is the second-stage scorer that reads each query-document pair together and produces a relevance signal the first pass could not afford to compute. The two-stage pattern is the floor of every production RAG stack in 2026, not an optional addition.

Why Retrieval Needs a Second Pass

First-pass retrievers are tuned for recall. They have to scan the whole corpus and return a candidate set within a tight latency budget. They cannot afford to deeply analyze every query-document pair, so they use scoring shortcuts: term overlap weighted by IDF (BM25), or cosine similarity between independently-encoded vectors (dense retrieval). These shortcuts are computationally cheap, which is why they work over millions of documents. They are also coarse, which is why a top-50 result list typically contains some documents that are not actually relevant.

The standard production response is to add a second stage. Take the top N candidates from the first pass, where N is typically 50 to 200. Score each candidate with a slower, more accurate model. Sort by the new scores. Return the top K, where K is typically 10. The reranker's job is to fix the ordering, not expand the set. If the right document is not in the top-N from the first pass, no amount of reranking will recover it. Recall is the first pass's responsibility; precision is the reranker's.

The numbers that matter for sizing N and K come from the eval discipline the measuring retrieval article walked. Recall@N tells you what fraction of relevant documents the first pass surfaces in its top N. If Recall@100 is 95% on your judged eval set, you can rerank confident that the relevant document is almost always present in the candidate set. If Recall@100 is 60%, fix the first pass before reaching for a reranker.

. . .

Bi-Encoders versus Cross-Encoders

The architectural distinction is the one thing to understand cold. It is what makes two-stage retrieval possible at all.

A bi-encoder encodes the query and the document independently. Each becomes a vector in the same embedding space. Relevance is the cosine similarity (or dot product) between the two vectors. The document vectors can be precomputed at index time, so at query time the system needs to compute only the query vector and a similarity score against each candidate. Sub-linear search via HNSW or IVF makes the candidate-set step efficient even at corpus sizes of hundreds of millions of documents.

A cross-encoder feeds the query and the document into the model as a single sequence:

[CLS] query tokens [SEP] document tokens [SEP]

The full transformer attention sees both at once. Every query token can attend to every document token. The model outputs a single relevance score for the pair, typically from the projected [CLS] hidden state. Nothing can be precomputed: the score depends on the joint sequence, which is unique per query-document pair. At query time, the cross-encoder must run one forward pass per candidate.

The asymmetry in compute cost is what enforces the two-stage pattern:

For a corpus of ten million documents and N = 100 candidates, the bi-encoder is the only feasible first pass and the cross-encoder is the only sufficiently accurate second pass. They are complementary, not competing.

BI-ENCODER CROSS-ENCODER Query tokens Document tokens Encoder A query side Encoder B doc side, precomputed query vector doc vector Cosine similarity cheap, scales to corpus first-pass retrieval [CLS] query tokens [SEP] doc tokens [SEP] single joint sequence unique per query-document pair Transformer with full attention every query token attends to every doc token no precomputation possible Relevance score one forward pass per candidate reranking second pass
Bi-encoder versus cross-encoder.

The intuition for why cross-encoders are more accurate: a bi-encoder has to compress every document into a single fixed-length vector that captures all the things any future query might ask about. A cross-encoder receives the query alongside the document and can model the interaction directly. The cross-encoder is not smarter; it has access to information the bi-encoder cannot fit through its encoding bottleneck.

. . .

The Two-Stage Pattern

The canonical pipeline:

  1. Query arrives. Optional: query-side transformation (HyDE, decomposition, step-back) runs first, producing one or more retrieval queries.
  2. First-stage retrieval against the index. Usually a hybrid of BM25 and dense vectors, fused by RRF or a similar primitive. Returns the top N candidates (typically 50 to 200).
  3. Reranker scores each of the N candidates against the original query. Sorts by relevance score.
  4. Top K are returned (typically 10) for the LLM to read.
STAGE 1 STAGE 2 STAGE 3 STAGE 4 Query arrives, optional transformation First-stage retrieval BM25 + dense vectors, RRF returns top N (50 to 200) Reranker cross-encoder scores N precision pass Top K ~10 chunks to the LLM RECALL JOB cheap, broad RECALL JOB bi-encoder, sub-linear ANN PRECISION JOB cross-encoder, joint attention CONTEXT bounded by window
Canonical two-stage retrieval pipeline.

The constants matter. N too small and the reranker has nothing to reorder; N too large and reranker compute becomes the latency bottleneck. K is bounded by the LLM's effective context window for the retrieved chunks, which is smaller than the model's maximum context window because the LLM also needs the user query, system prompt, and reasoning space.

Two common variations on the base pattern:

. . .

The Cross-Encoder in Detail

The first cross-encoder built specifically for relevance ranking was a BERT-based architecture trained on MS-MARCO in 2019.2 The recipe was simple: take a pretrained BERT, concatenate the query and document with a [SEP] token, fine-tune on labeled passage-ranking pairs. The output is a relevance score from the [CLS] token. The simplicity is the point. The architecture has barely changed in seven years; what has changed is the size of the underlying model, the size of the training data, and the breadth of domains the released models cover.

Training signal comes from supervised relevance labels. The two standard sources:

Loss functions are typically binary cross-entropy on positive-versus-negative pairs or a contrastive ranking loss that pushes positive pairs above negatives by a margin. The released open-source rerankers cite both kinds of training in their model cards.

One technical complication worth naming: cross-encoders inherit the context window of the underlying transformer. A 512-token model can see at most 512 tokens of query plus document combined. Documents longer than 500 tokens must be chunked before reranking, and the reranker scores each chunk independently. This is the same chunking problem the dense retrievers face, and the same long-document remedies apply: late chunking, hierarchical reranking, document-level pooling of chunk scores.

. . .

Reranker Selection in 2026

The reranker landscape divides cleanly into open-source and hosted commercial options. The choice usually comes down to latency budget, multilingual requirements, and whether the deployment runs in a network with reliable outbound API access.

RerankerTypeParametersNotable for
ms-marco-MiniLM-L-6-v2Open source22MSpeed; the workhorse 2019 baseline still in active use
BAAI/bge-reranker-v2-m3Open source568MMultilingual coverage; strong on BEIR zero-shot
BAAI/bge-reranker-v2-gemmaOpen source2BLarger context, distilled from Gemma
Jina Reranker v2Open source278MLong-context (8K tokens), production-grade
Cohere Rerank 3.5Hosted APIUndisclosed100+ languages, low-friction integration
Voyage Rerank 2Hosted APIUndisclosedDomain-tuned variants for code, finance, legal

Rough latency numbers on commodity GPU hardware, for N = 100 candidates per query:

The MTEB Reranking leaderboard is the standard public benchmark for comparing rerankers across domains.3 Position on the leaderboard correlates with general reranking quality but does not predict in-domain performance, particularly for legal, medical, or specialized technical corpora where fine-tuning on local relevance labels typically dominates leaderboard rankings.

. . .

Reciprocal Rank Fusion as the Other Tool

The fixing the query article covered Reciprocal Rank Fusion in detail. It belongs in the reranking conversation too because the two techniques compose cleanly.

RRF combines multiple ranked lists into one without needing to know the raw scores. For each document, its RRF score is the sum across input lists of 1 over (60 plus its rank in that list). Documents that appear high in multiple lists win. The constant 60 is the SIGIR 2009 default that every major implementation inherits.4

The natural composition with reranking:

  1. Run BM25 against the corpus. Take top 100.
  2. Run dense vector retrieval against the corpus. Take top 100.
  3. Run learned-sparse retrieval (SPLADE, ELSER) if available. Take top 100.
  4. Fuse the three ranked lists with RRF. The fused candidate set has the union of all three.
  5. Rerank the fused candidate set with a cross-encoder. Return top K.

RRF handles the diversity of the candidate set; reranking handles the precision of the final ordering. The two techniques compose cleanly because they do different jobs.

The Anthropic Contextual Retrieval recipe published in 2024 is the canonical reference for this exact composition: BM25, contextualized dense vectors, RRF fusion, Cohere Rerank as the final pass. Reported numbers showed that adding the reranker on top of the BM25-plus-vector hybrid reduced retrieval failure rates by a further 40 percent over the hybrid alone.1

. . .

When Reranking Earns Its Compute

Reranking is not free, and not every query benefits equally. The decision framework has three axes: latency budget, recall headroom, and query type.

Latency budget. The compute cost of reranking is roughly linear in N. A small reranker at N = 100 adds about 50 to 100 milliseconds per query. A medium reranker adds 200 to 500 milliseconds. A large reranker can add a second or more.

LATENCY BUDGET WORKLOAD RERANKER FIT UNDER 100 MS Sub-100 ms end-to-end Autocomplete, low-latency search SKIP RERANKING strong hybrid first stage only 100 TO 500 MS 100 to 500 ms interactive search Standard search box, faceted query UI SMALL RERANKER MiniLM, N = 50 to 100 500 MS TO 2 S 500 ms to 2 s chat-style RAG Conversational answers, document Q&A MEDIUM RERANKER bge-v2-m3, N = 100 2 S AND ABOVE 2 s and above Deep research, agentic workflows LARGE OR MULTI-PASS bge-v2-gemma, LLM-as-reranker
Reranker choice by latency budget.

Recall headroom. Reranking can only reorder what the first stage surfaced. If Recall@100 is already 95% on a representative eval set, the reranker has room to add precision. If Recall@100 is 60%, the right document is missing from the candidate set for 40% of queries and no reranking will recover it. Fix the first stage first; rerank later.

Query type. Different query distributions benefit unevenly:

A practical first move when adding reranking to a system: stratify the eval set by query type, measure first-stage precision and reranker-aided precision separately for each stratum, and confirm the lift is concentrated in the query types where reranking is expected to help. If the lift is uniform across types, the eval is probably under-stratified and is hiding where the win actually comes from.

. . .

Production Pitfalls

Latency under load is the first thing teams underestimate. A reranker that takes 80 milliseconds per query on a quiet GPU may take 250 milliseconds under concurrent load. Batching helps throughput at the cost of per-query latency. The right batch size depends on traffic patterns and the underlying serving framework.

Cache locality is the second. Query vectors can be cached, and document vectors are precomputed by definition, but reranker scores depend on the joint sequence and cannot be cached across queries. A reranker hit rate is essentially zero unless the same query appears repeatedly with the same candidate set, which is rare.

Model freshness is the third. A reranker trained on MS-MARCO is a general-purpose model. It performs adequately on most domains but rarely matches a model fine-tuned on in-domain relevance labels. The fine-tuning path is straightforward: collect a few thousand judged pairs from production logs, fine-tune the open-source base model, deploy. Most teams put this off too long because the unfine-tuned baseline works adequately, and then discover their domain-specific failure modes only after a downstream incident.

Cold-start is the fourth. A new RAG system has no judged labels and no production logs. The reranker is therefore a generic MS-MARCO-trained model whose in-domain performance is unknown. LLM-judged relevance labels are the standard workaround: ask a current frontier model (Claude 4.7, GPT-5.x, Gemini 3.x) to grade a sample of query-document pairs against a written rubric, use those judgments as both training data and eval data, monitor for divergence as production data accumulates.

A/B testing reranking changes is the fifth pitfall and the most subtle. Reranking changes the visible top-K. If you measure success by user click-throughs, the existing reranker's choices bias the click distribution: users click what they see. Holdout evaluations have to compare reranked versus un-reranked retrievals on judged eval sets, not on click logs alone. The judged eval set must be representative of the production query distribution, which means refreshing it on a regular cadence.

. . .

A Worked Example

The query, in our running oil-and-gas worked-example domain:

What are the recommended remediation procedures for differential sticking in deviated wells with high overbalance?

The first-stage retrieval is a hybrid: BM25 plus a dense vector retriever, fused by RRF. Both return their top 100 against an indexed corpus of drilling-operations documents. The fused candidate set has approximately 130 unique documents (overlap between BM25 and dense is partial, not complete).

Sampling the top 10 of the fused candidate set, before reranking:

After the cross-encoder reranker scores all 130 candidates, the top 10 changes substantially:

The lift is concentrated in the queries that combine multiple constraints. The reranker is doing the work that the bi-encoder's compressed vector cannot: recognizing which combinations of constraints a document actually satisfies.

. . .

Honest Limitations

Reranking is the most consistently useful single addition to a basic RAG retrieval stack. It is not a fix for every failure mode.

Domain shift. A reranker trained on MS-MARCO encodes a particular distribution of query types and relevance criteria. Legal corpora, medical literature, code documentation, and specialized technical domains all show measurable performance drops compared to web-search benchmarks. The remedy is fine-tuning on in-domain labels, which is feasible but costs time the team often does not have at first.

Long documents. Cross-encoders inherit the context window of their underlying transformer. Most are 512 tokens; the larger ones reach 8K. Documents longer than the window must be chunked, and chunks are scored independently. A document whose relevance to the query is distributed across multiple chunks may not surface as well as a document whose relevance is concentrated in a single chunk.

Short queries with little lexical signal. The cross-encoder is most powerful when the query carries enough tokens to interact with the document. Two-word queries give the model little to attend to, and the reranker score collapses toward the prior. For under-specified queries, the right intervention is on the query side (multi-query, query2doc) rather than on the reranker side.

Latency at extreme scale. A reranker running on every query at thousands of queries per second requires substantial GPU capacity. The hosted commercial rerankers shift the operational burden but introduce per-call pricing that adds up quickly at scale. Open-source rerankers running on owned hardware are typically more cost-effective above a certain volume.

Compute is real. The 40 to 200 milliseconds of added latency from a medium reranker is invisible in chat-style RAG but unacceptable in autocomplete or low-latency search. The decision to add reranking is a decision to spend that compute on every query, and the system needs to be sized for it.

. . .

References

Textbook grounding, chapter-level citations, and further reading for each numbered reference in this article live on the companion sources page.

  1. Anthropic. (2024). "Introducing Contextual Retrieval." Anthropic engineering blog. The canonical production recipe combining BM25, contextualized dense vectors, RRF fusion, and Cohere Rerank as the final pass; reports a 49% retrieval-failure reduction over baseline with the full stack.
  2. Nogueira, R., & Cho, K. (2019). "Passage Re-ranking with BERT." arXiv. The first paper to apply a BERT cross-encoder to passage ranking on MS-MARCO; established the architecture every subsequent reranker inherits.
  3. Muennighoff, N., Tazi, N., Magne, L., & Reimers, N. (2023). "MTEB: Massive Text Embedding Benchmark." Hugging Face. The standard public benchmark for comparing embedding models and rerankers across tasks and languages.
  4. Cormack, G. V., Clarke, C. L. A., & Buttcher, S. (2009). "Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods." SIGIR 2009. Establishes the RRF formula and the k=60 default that every modern implementation inherits.
  5. Khattab, O., & Zaharia, M. (2020). "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT." SIGIR 2020. Introduces late interaction, a middle point between bi-encoder and cross-encoder that preserves per-token document representations.
  6. Reimers, N., & Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." EMNLP 2019. The Sentence-BERT baseline that made bi-encoder retrieval practical and the reference architecture most open-source bi-encoders still build on.
  7. Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., & Gurevych, I. (2021). "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models." NeurIPS Datasets and Benchmarks. The standard zero-shot retrieval benchmark; nearly every reranker on the Hugging Face leaderboard reports BEIR numbers.
  8. BAAI. (2024). "bge-reranker-v2-m3 model card." Hugging Face. The reference open-source multilingual reranker as of 2026, with documented training procedures and BEIR results.
Re-ranking Cross-Encoder Bi-Encoder Two-Stage Retrieval BM25 RAG Information Retrieval