← Back to article

Sources

Grounding, citations, and further reading for Re-ranking: The Second Chance.

All of this is optional. The article itself is the tutorial. This page exists for readers who want to trace each numbered reference back to the primary source, see the original research framing, and pick up further reading on cross-encoder architecture, hybrid retrieval, and reranker evaluation.

Nothing here is required. The numbered references in the article hyperlink to the entries below; each entry carries a back-to-article link so you can resume reading where you left off.

About the Sources

Anthropic: Contextual Retrieval

Anthropic engineering blog, September 2024.

The canonical production recipe combining BM25, contextualized dense vectors, RRF fusion, and Cohere Rerank as the final pass. Reports a 49 percent retrieval-failure reduction over baseline with the full stack, and the marginal gain of the reranker on top of the hybrid first stage. Available at anthropic.com/news/contextual-retrieval.

Nogueira & Cho: Passage Re-ranking with BERT

Nogueira, R., & Cho, K. (2019). arXiv:1901.04085.

The first paper to apply a BERT cross-encoder to passage ranking on MS-MARCO. Established the [CLS] / [SEP] concatenation architecture every subsequent cross-encoder reranker inherits. Available at arxiv.org/abs/1901.04085.

Muennighoff et al.: MTEB

Muennighoff, N., Tazi, N., Magne, L., & Reimers, N. (2023). arXiv:2210.07316.

The Massive Text Embedding Benchmark. Standard public benchmark for comparing embedding models and rerankers across tasks and languages. The reranking subset is what teams cite when comparing the open-source candidate set. Available at arxiv.org/abs/2210.07316.

Cormack, Clarke & Buttcher: Reciprocal Rank Fusion

Cormack, G. V., Clarke, C. L. A., & Buttcher, S. (2009). SIGIR 2009.

Establishes the RRF formula and the k=60 default that every modern implementation inherits. The paper proves that RRF outperforms Condorcet methods and per-list rank-learning methods on standard IR benchmarks. Available at cormack.uwaterloo.ca.

Khattab & Zaharia: ColBERT

Khattab, O., & Zaharia, M. (2020). SIGIR 2020.

Introduces late interaction, a middle point between bi-encoder and cross-encoder that preserves per-token document representations and computes a MaxSim score at query time. Often cited as the architecture that taught the field to stop treating bi-encoder vs cross-encoder as a binary. Available at arxiv.org/abs/2004.12832.

Reimers & Gurevych: Sentence-BERT

Reimers, N., & Gurevych, I. (2019). EMNLP 2019.

The Sentence-BERT baseline that made bi-encoder retrieval practical at scale. Reference architecture most open-source bi-encoders still build on, and the source of the sentence-transformers library that powers most reranker training pipelines. Available at arxiv.org/abs/1908.10084.

Thakur et al.: BEIR

Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., & Gurevych, I. (2021). NeurIPS Datasets and Benchmarks.

The standard zero-shot retrieval benchmark. Nearly every reranker model card reports BEIR numbers, and the paper's domain-shift findings are the empirical grounding for the "domain shift" limitation discussed in the article. Available at arxiv.org/abs/2104.08663.

BAAI: bge-reranker-v2-m3

BAAI (2024). Hugging Face model card.

The reference open-source multilingual reranker as of 2026, with documented training procedures and BEIR results. The 568M-parameter model that most teams without a domain-specific reranker default to. Available at huggingface.co/BAAI/bge-reranker-v2-m3.

The Two-Stage Pattern

1Anthropic Contextual Retrieval as the production recipe

The Anthropic Contextual Retrieval writeup is the canonical production reference for the two-stage pattern in 2024 to 2026. The recipe combines BM25 against raw chunks, dense vector retrieval against contextualized chunks (each chunk prefixed with an LLM-generated summary of the surrounding document), Reciprocal Rank Fusion to combine the two ranked lists, and a Cohere Rerank cross-encoder as the final pass.

The reported gains stack: contextualizing the chunks reduces retrieval failure by about 35 percent over baseline, hybrid BM25-plus-vector reduces by 49 percent, and adding the reranker on top reduces by 67 percent total. The marginal contribution of the reranker (about 40 percent improvement over the hybrid alone) is the empirical anchor for the "reranking is the most consistently useful single addition" framing in the article.

Anthropic (2024), Introducing Contextual Retrieval. anthropic.com

↩ Back to article

The Cross-Encoder in Detail

2Nogueira & Cho on the original cross-encoder reranker

The Nogueira and Cho paper is the architectural reference point for every cross-encoder reranker in production today. The recipe is unchanged from the 2019 paper: take a pretrained BERT, concatenate the query and document with a [SEP] token, fine-tune on labeled passage-ranking pairs, project the [CLS] hidden state to a single relevance score.

The paper reports a 27 percent relative improvement in MRR@10 on MS-MARCO over the BM25 baseline, which was the first time the field had a strong empirical argument for using transformer-based reranking despite the per-query cost. Every subsequent reranker (MiniLM, the BGE series, Jina, Cohere, Voyage) inherits this architecture and differs only in scale, training data, and language coverage.

Nogueira & Cho (2019), Passage Re-ranking with BERT. arXiv:1901.04085

↩ Back to article

Reranker Selection in 2026

3MTEB as the cross-domain leaderboard

The Massive Text Embedding Benchmark is the standard public benchmark for comparing embedding models and rerankers across tasks and languages. The reranking subset evaluates models on a set of retrieval datasets where relevance judgments are available, and the leaderboard position correlates with general reranking quality.

The caveat the article calls out is documented directly in the paper: zero-shot transfer is uneven across domains, and in-domain fine-tuning typically dominates leaderboard rank when the target domain diverges from the benchmark distribution. Treat MTEB position as a starting point, not as a prediction of in-domain performance.

Muennighoff et al. (2023), MTEB: Massive Text Embedding Benchmark. arXiv:2210.07316

↩ Back to article

Reciprocal Rank Fusion as the Other Tool

4Cormack, Clarke & Buttcher on RRF

The Cormack, Clarke, and Buttcher SIGIR 2009 paper establishes Reciprocal Rank Fusion as a strong default for combining ranked lists, and it is the source of the k=60 constant that every major implementation inherits. The formula sums 1 / (k + rank) across input lists for each document, with rank starting at 1.

The paper's empirical contribution is that RRF outperforms Condorcet methods and per-list rank-learning methods on standard IR benchmarks, despite using no document-level scores and no learned weights. The simplicity is the durability: a method with no training requirement, no per-corpus tuning, and no score-calibration step survived sixteen years of methodological progress because it is hard to beat on the average case.

Cormack, Clarke & Buttcher (2009), Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. SIGIR 2009

↩ Back to article