Measuring Retrieval
You cannot improve what you cannot measure. For retrieval there are two distinct measurement disciplines, and which one applies depends on which retriever is sitting underneath the question. A BM25 system can be evaluated end-to-end inside your own measurement loop because the algorithm is yours and the data is yours. An embedding-model-based system needs a shared external benchmark, and MTEB is the one the field has converged on, because no team owns the embedding model itself; comparability has to be manufactured rather than inherited. This article walks both: how to build a BM25 evaluation loop on your own corpus, and how to read MTEB without being misled by it.
The discipline of measurement matters more than any specific tool. If you cannot quantify the performance of a retrieval system on a representative query set, you cannot tell whether the next version is better, whether a parameter sweep helped, or whether an upgrade you just paid for moved the number you care about. The choice without measurement is religion. The choice with measurement is engineering. The argument the article makes is not that BM25 evaluation is right or that MTEB is right; the argument is that some quantitative loop must close, and the loop that closes for BM25 is shaped differently from the loop that closes for embedding models.
Retrieval quality is not a feeling. It is a number you measure on a labeled set of queries, against a defined metric, with a procedure that another engineer could reproduce. Without that number, the question of whether your retriever is "good" has no answer and the question of whether the next one is "better" has no answer either.
The article walks the two loops the field uses: the closed loop you can run on BM25 inside your own engineering, and the open-coordinate-system loop that MTEB provides for embedding models. Both are quantitative; both are imperfect; both are necessary in their respective domains.
The Two Disciplines
BM25 and dense embedding models both turn a query into a ranked list of documents, but as objects to measure they could not be more different. The difference matters because it determines whether the evaluation loop closes inside your own engineering or whether you have to lean on a shared external benchmark to compare against vendors.
| Property | BM25 | Embedding models |
|---|---|---|
| Mechanism | Closed-form formula. Hand-computable for a small corpus. | Opaque transformer forward pass. Billions of weights. |
| Score decomposition | Trace any score to TF, IDF, and length-norm components. Explain why doc X beat doc Y. | Cosine of two high-dimensional vectors. No human-readable derivation. |
| Parameter tuning | k1 and b live on your laptop. Sweep them against your data with no vendor in the loop. | The "parameters" are billions of weights that cannot be tuned without retraining. |
| Reproducibility across teams | Two teams running BM25 on the same corpus get identical numbers. | "text-embedding-3" is a service that can change silently. Reproducibility requires sharing weights. |
| Vendor coupling | Zero. BM25 is in the public domain. | OpenAI, Cohere, BGE, NV-Embed, Voyage, Jina, Snowflake; each a different artifact. |
The implication of this table is that the measurement loop for BM25 is closed inside your own engineering. You implement BM25 (or use the well-known Lucene implementation that sits underneath Elasticsearch, OpenSearch, and Whoosh), you build a relevance-judged evaluation set on your own data, you sweep k1 and b against your queries, you measure precision and recall. Everything you need lives in your repository. The numbers you produce are reproducible by anyone with the same data, because the algorithm is fixed and shared.
The measurement loop for embedding models is not closed. You can build a relevance-judged evaluation set on your own data, you can run candidate models against it, and you can measure precision and recall the same way you do for BM25. What you cannot do is tune the embedding model itself; you can only swap one model for another. You also cannot guarantee that another team running "the same model" on the same data will produce the same numbers, because the model is either a hosted service that can change underneath you or a 7B-parameter checkpoint that has to be downloaded, hosted, and run with matching preprocessing.
MTEB exists to manufacture comparability in that second world, where comparability is not free. The benchmark gives every embedding model a shared coordinate system. The leaderboard gives the field a shared vocabulary for what "good" means before any single team has run the model on its own data. None of that infrastructure is needed for BM25 because BM25 has comparability built into the algorithm; if BM25 lived inside seventeen competing vendor APIs the way embedding models do, BM25 would need an MTEB too.
The argument the article makes is not that one loop is more rigorous than the other. They are different loops for different problems. The BM25 loop is sufficient when the retriever is BM25 and the data is yours. The MTEB loop is necessary when the retriever is an embedding model and the comparison space is everything the field has shipped this year. The rest of the article walks both, starting with the more self-contained one.
BM25 evaluation cycles through your own corpus and parameters, so the loop closes inside your engineering. MTEB evaluation fans third-party models into a shared test you do not own and a leaderboard you read against the field, so comparability is manufactured by the shared test rather than owned by any one team.
A BM25 Evaluation Loop You Run Yourself
The BM25 evaluation loop has the same five steps every information-retrieval evaluation has had since the Cranfield experiments of the 1960s. The fact that the loop is sixty years old is not a knock against it; the structure of the problem (rank documents by relevance to a query, then compare the ranking against a judged ground truth) has not changed and does not need to. What has changed is the tooling, and modern BM25 evaluation is something a small team can stand up in an afternoon.
These are the five steps every IR evaluation has had since Cranfield. The two yellow boxes (steps 4 and 5) are where measurement turns into improvement: sweep the parameters, stratify the metrics, repeat.
1. Build a relevance-judged evaluation set
The starting point is a set of queries paired with the documents in your corpus that are relevant to each query. For a small evaluation set, the labels can be binary (relevant or not). For a more nuanced set, the labels can be graded (highly relevant, partially relevant, not relevant), which is what nDCG was designed to use. The judgments come from one of three sources, in order of fidelity:
- Click-through data from production logs. If your system is already deployed, the documents users clicked after a given query are a reasonable proxy for relevance. The signal is noisy (users click for reasons other than relevance) but plentiful and cheap.
- Subject-matter expert annotation. A domain expert reads a sample of (query, document) pairs and labels them. The signal is high-quality but the labor cost limits how many pairs can be judged. A few hundred queries against the top fifty BM25 results per query is enough to drive evaluation for a corpus in the hundreds of thousands.
- LLM-judged annotation. A current-generation LLM, given the query, the document text, and a labeling rubric, can produce judgments that correlate well with expert annotation on most domains. This is the lowest-cost option and the one most teams now reach for. The trade-off is that the labels reflect the LLM's biases as well as the domain's ground truth, and the labels should be spot-checked.
The evaluation set does not need to be large. A few hundred queries with judged top-50 results per query is enough to surface real differences between BM25 configurations. The goal is not to certify the system in the abstract; it is to compare two configurations of the same system and pick the one that is measurably better on the queries you care about.
2. Choose the metrics
Four metrics cover almost every BM25 evaluation in production use:
- Precision@k: of the top k documents BM25 returned, how many were judged relevant? This is the metric a downstream system actually feels, because the downstream system only sees the top k. If k = 10 and precision@10 = 0.6, six of the ten documents the user (or the LLM context window) gets are relevant.
- Recall@k: of all the documents judged relevant for this query, how many appear in the top k? This catches the failure mode where BM25 returned highly-precise top-3 but missed the document the user actually needed.
- Mean Reciprocal Rank (MRR): the average of 1/rank for the first relevant document returned. A relevant doc at rank 1 contributes 1.0; at rank 2 contributes 0.5; at rank 10 contributes 0.1. MRR rewards getting at least one relevant doc near the top.
- nDCG@k (Normalized Discounted Cumulative Gain): a graded-relevance metric that discounts gain by rank position. nDCG@10 is the dominant single-number summary in BEIR and MTEB Retrieval, and reporting it lets your numbers be compared against published benchmarks even when your evaluation set is private.
For a Week 5 RAG context the most diagnostic pair is precision@10 (does the LLM see relevant context?) and recall@50 (did BM25 surface the right doc at all, even if reranking would be needed to move it up?). Reporting both numbers stratified by query type (named-entity queries, conceptual queries, exact-identifier queries) catches uneven failure modes that the aggregate would hide.
3. Run BM25 against your evaluation set
Three BM25 implementations dominate the open-source landscape:
- Lucene, via Elasticsearch or OpenSearch or as a raw Lucene index. This is the implementation most production systems are already running. Defaults: k1 = 1.2, b = 0.75. Indexing and querying APIs are well documented and the analyzer pipeline (tokenizer, stemmer, stop-word filter) is configurable per field.
- Pyserini, a Python wrapper around Lucene built specifically for IR research. The standard tool for academic BM25 baselines and the one most BEIR papers report against.
- rank_bm25, a pure-Python implementation that requires no JVM and no index build. Slower than Lucene at scale but trivially fast for evaluation sets in the hundreds of thousands of documents, and trivial to integrate into a Jupyter notebook.
For an evaluation loop, the choice does not matter much; all three return the same top-k for the same parameters on the same data, modulo tokenization differences. The choice does matter when the BM25 evaluation has to match a production index byte-for-byte, in which case using the same implementation as production is the only safe option.
4. Sweep k1 and b
The k1 and b parameters are the two knobs BM25 exposes for tuning to a corpus. k1 controls how quickly the term-frequency reward saturates (higher k1 means repeated terms keep adding signal for longer; lower k1 means the reward saturates earlier). b controls how aggressively long documents are penalized (b = 0 ignores length entirely; b = 1 fully normalizes for length). The defaults k1 = 1.2 and b = 0.75 work well on most corpora, but a corpus of short product descriptions or a corpus of long legal contracts can both benefit from non-default values.
A k1 / b sweep is the simplest experiment in IR. Pick a grid (k1 in {0.8, 1.0, 1.2, 1.5, 2.0}, b in {0.5, 0.6, 0.75, 0.9}), run BM25 against the evaluation set for each cell, and report nDCG@10. The result is a 5 by 4 matrix of nDCG values, and the highest cell wins. If the highest cell is the default (1.2, 0.75), you have just verified that the defaults work for your corpus, which is information worth having. If the highest cell is somewhere else, you have just earned a measurable improvement at the cost of one afternoon of compute.
5. Stratify the metrics
An aggregate nDCG@10 across a thousand queries can hide a model that performs well on conceptual queries and badly on exact-identifier queries, or vice versa. The standard fix is stratification: report the aggregate metric per query category. The categories that show up most often in production BM25 evaluation are:
- Exact-identifier queries: error codes, SKUs, ticket IDs, named methods. BM25 should dominate here.
- Short conceptual queries: two- or three-word topical queries. BM25 is competitive but vector retrievers often catch paraphrase BM25 misses.
- Long natural-language queries: sentences and questions. BM25's bag-of-words signal degrades; rephrased equivalent queries can produce different rankings.
- Multi-term technical queries: like the running drilling-engineer example from the classic-search article ("stuck pipe with wellbore instability"). BM25 wins on the technical vocabulary, but the conceptual relation between terms is invisible to it.
Stratified reporting is what separates a useful evaluation from a vanity metric. The aggregate says BM25 is "good." The stratified breakdown says BM25 is excellent on identifier queries (precision@10 = 0.95), good on multi-term technical queries (precision@10 = 0.72), and mediocre on long natural-language queries (precision@10 = 0.48). That stratified picture is what tells you when to reach for a different retriever and when to stick with BM25.
When this loop is enough; when it isn't
The BM25 evaluation loop is enough when:
- The retriever is BM25 (or BM25-with-custom-analyzer) and you control the index.
- The query distribution is dominated by exact-identifier, technical-vocabulary, or short-phrase queries.
- The evaluation set covers a representative slice of the production query distribution.
- The corpus changes slowly enough that today's evaluation predicts tomorrow's behavior.
The BM25 loop is not enough when the retriever is no longer just BM25. Once dense vectors or sparse-neural encoders enter the stack, the comparison space expands beyond what the BM25 algorithm produces, and the BM25 loop has no way to evaluate the alternatives. At that point the conversation moves to the second loop, which is what the rest of the article walks. The transition is not "throw BM25 evaluation away." The transition is "now you need both."
The MTEB Loop, for Embedding Models
From here forward the article walks the second discipline: how MTEB measures embedding models, what it covers, what it does not, and how to read the leaderboard without being misled. The transition is the natural one: when the BM25 loop above is no longer sufficient because the retriever is something other than BM25, MTEB is what the field uses to compare alternatives.
The Benchmark in Brief
MTEB was introduced by Muennighoff, Tazi, Magne, and Reimers at EACL 2023.1 The original paper described a benchmark spanning eight embedding tasks across 58 datasets and 112 languages, evaluated on 33 contemporary embedding models. The headline empirical finding, reported as a single sentence in the abstract, has aged into a permanent feature of the field: "no particular text embedding method dominates across all tasks." Different models win different categories, and there is no single embedding that you should always pick.
The benchmark is distributed as a Python package mteb on PyPI, with the source on GitHub at embeddings-benchmark/mteb.2 The repository describes the toolbox as a "multimodal toolbox for evaluating embeddings and retrieval systems," and instructs users to cite both the original MTEB paper and the later MMTEB extension when reporting results.
The project has grown considerably since 2022. The official documentation site states that the current MTEB toolbox spans "more than a 1000 different tasks" across "more than 1000 languages," covering both image and text modalities.3 Most of that expansion lives in MMTEB, the Massive Multilingual Text Embedding Benchmark, which the next section gets to. The original eight-task English-centric MTEB sits inside that larger umbrella as the most-watched subset of the leaderboard.
Each task in MTEB is a Python class that subclasses an abstract task type and declares its own metadata. A retrieval task, for example, declares its dataset path, the modality (text-to-text), the evaluation split, the language, and the main scoring metric (usually nDCG at 10 for retrieval).4 The architectural decision to make every task a versioned class is what lets the benchmark grow while remaining reproducible. If a dataset changes upstream, the MTEB class can pin its version. If a new task is added, it slots in alongside the others without changing the runner.
How an MTEB Pass Actually Runs
An MTEB evaluation, run in full, is a substantial computation. The benchmark's official README documents two invocation patterns. The Python form is the one most users start with: import mteb, load a SentenceTransformer-compatible model, select tasks, and call mteb.evaluate on the pair.5
# Minimal example from the MTEB README. import mteb from sentence_transformers import SentenceTransformer model = mteb.get_model("sentence-transformers/all-MiniLM-L6-v2") tasks = mteb.get_tasks(tasks=["Banking77Classification.v2"]) results = mteb.evaluate(model, tasks=tasks)
That is the smallest end-to-end MTEB evaluation: load a model, pick a task, run. The same evaluation can be invoked from the shell, which is useful for batch evaluations on a remote GPU machine:
mteb run \
-m sentence-transformers/all-MiniLM-L6-v2 \
-t Banking77Classification.v2 \
--output-folder results
A custom embedding model, one not already wrapped by the mteb package, integrates through a small Python class that exposes an encode method matching the benchmark's EncoderProtocol signature.6 The same mteb.evaluate call accepts that class and runs it through every task in the selected suite. This is the integration path every model-release team uses to wrap a new architecture (E5, GTE, BGE, NV-Embed, and so on) into the same evaluation harness.
The contract is worth naming explicitly because it determines what MTEB can and cannot evaluate. MTEB is encoder-agnostic: any model that exposes an encode method returning vectors fits, whether those vectors are dense (sentence-transformers, E5, BGE, NV-Embed) or sparse-neural (SPLADE, ELSER). ELSER v2, in fact, sits in the top ten of the MTEB Retrieval leaderboard when the multiple flavors of each competitor family are grouped together, which is direct evidence that sparse-neural retrieval is a first-class citizen of the benchmark, not a footnote.39
MTEB is not, however, retrieval-architecture-agnostic. A pure BM25 retriever has no encode step, so it cannot be wrapped through the EncoderProtocol; BM25 baselines on MTEB retrieval tasks are run by separate tooling and reported alongside, not produced inside the MTEB pass. A hybrid retriever (BM25 plus dense, fused with reciprocal rank fusion) has no fusion primitive in the runner either; end-to-end hybrid evaluation lives in BEIR-style harnesses that are designed for that kind of pipeline. The practical implication is that MTEB scores describe the quality of an embedding component, not the quality of a complete retrieval system. Two retrieval systems with very different end-to-end behavior can share an embedding component and the same MTEB score.
Running a full English MTEB pass is not free. The MMTEB paper documents that "the English MTEB (henceforth referred to as MTEB(eng, v1)) benchmark requires up to two days of processing on a single A100 GPU."7 The MMTEB extension introduces a zero-shot English subset specifically to reduce that cost while preserving the ranking order, because reproducible evaluation at the scale of the full benchmark is genuinely expensive. Two days of A100 time is the practical cost of one comparable score on the leaderboard.
This matters for teaching. A homework assignment that asks a student to "evaluate an embedding model on MTEB" without scoping which subset to run is a homework assignment that finishes in two days on rented hardware. The standard pedagogical move is to pick a small, representative slice: a single retrieval task, or a single language, or the zero-shot English subset from MMTEB.
What MTEB Measures: The Datasets
The eight task types in the original MTEB are abstractions. The actual evaluation is run on concrete datasets, most of which existed before MTEB and were folded into the benchmark as a curated suite. The retrieval sub-suite, in particular, is drawn almost entirely from BEIR, the heterogeneous zero-shot information retrieval benchmark that Thakur et al. introduced at NeurIPS 2021.8 BEIR was already the standard cross-domain retrieval evaluation by the time MTEB launched, and pulling it in wholesale was a deliberate choice: MTEB inherits BEIR's coverage of news, biomedical, financial, scientific, and conversational retrieval, and BEIR inherits MTEB's visibility.
The retrieval datasets MTEB exposes are worth enumerating because each one has a specific shape. A model that is great on one and weak on another is making a measurable claim about its training distribution.
| Dataset | Domain / Style | Source |
|---|---|---|
| MS MARCO | ~1M real Bing queries paired with passages from web documents | Nguyen et al. (2016)9 |
| Natural Questions | ~300K real Google queries with Wikipedia long/short answers, including nulls | Kwiatkowski et al. (2019)10 |
| HotpotQA | 113K Wikipedia multi-hop questions with sentence-level supporting facts | Yang et al. (2018)11 |
| FEVER | 185K Wikipedia-derived claims labeled Supported / Refuted / NotEnoughInfo | Thorne et al. (2018)12 |
| TREC-COVID | Biomedical IR test collection over COVID-19 scientific literature | Voorhees et al. (2020)13 |
| SciFact | 1.4K expert-written scientific claims paired with research abstracts | Wadden et al. (2020)14 |
| FiQA | Opinion-based financial QA from StackExchange, Reddit, StockTwits | Maia et al. (2018)15 |
| ArguAna | Counterargument retrieval from idebate.org argument pairs | Wachsmuth et al. (2018)16 |
| CLIMATE-FEVER | Real-world climate change claims, FEVER methodology applied to internet text | Diggelmann et al. (2020)17 |
| SciDocs | Seven scientific-document tasks, citation-graph-pretrained | Cohan et al. (2020)18 |
| DBpedia-Entity v2 | Entity search with crowdsourced relevance judgments | Hasibi et al. (2017)19 |
A pattern emerges from the list. The retrieval tasks are dominated by what one could call search-engine-style queries: short user-typed questions matched against Wikipedia pages, web passages, or domain corpora. The judgment protocol is closer to "does this passage contain the answer" than to "does this passage support deep multi-step reasoning." That is not a flaw of MTEB; it is a description of what MTEB has data for. The retrieval sub-suite measures generalization across domains, not across cognitive difficulty. The benchmark is honest about this in its task selection, and the limit shows up cleanly in later benchmarks that try to measure something else.
The non-retrieval tasks (classification, clustering, STS, reranking, pair classification, summarization, bitext mining) draw from a different mix of datasets: Banking77 for intent classification, Amazon reviews for clustering, STS-B for semantic similarity, MIRACL and MS MARCO for reranking, and so on. The original MTEB paper enumerates the full set; the official documentation site keeps the live list. The point worth holding onto is that MTEB's "score" is an average across these heterogeneous tasks, and an embedding model that excels at retrieval while underperforming at classification can still post a strong aggregate score. Drilling into the per-task breakdown is part of using the benchmark responsibly.
Where MTEB Has Grown: Extensions
The original 2022 MTEB was English-centric, English-only for retrieval and clustering, and bounded by the 58 datasets the authors could integrate at launch. Every subsequent year has added extensions, both from the original team and from independent groups. Tracking which extension covers which evaluation gap is part of reading the modern leaderboard.
MMTEB: the multilingual successor
The Massive Multilingual Text Embedding Benchmark, presented at ICLR 2025, is a community-driven extension co-led by Enevoldsen et al. with roughly 85 contributors including Niklas Muennighoff (the original MTEB lead).7 It expands the task count from 8 to 10, the dataset count from 58 to "over 400," and the language coverage to 250+ languages excluding bitext mining (and more than 1,000 including it). It also introduces new task categories that MTEB v1 did not have: instruction following, long-document retrieval, and code retrieval.
MMTEB is the answer to one of MTEB v1's loudest acknowledged limits, namely English-only retrieval and clustering. It is also the benchmark that most modern multilingual embedding models report against today. The zero-shot English subset within MMTEB is the practical evaluation slice for English-only model comparisons that want to be cheaper than two days of A100 time.
Language-specific variants
Beyond MMTEB, several language-specific MTEB-derived benchmarks have emerged:
- C-MTEB, the Chinese variant, was released as part of the C-Pack package alongside the BGE family of Chinese and English embedding models.20 Six tasks, 35 datasets.
- MTEB-French covers eight task categories with 15 existing French datasets and three new ones, evaluating 51 embedding models in the launch paper.21
- PL-MTEB, the Polish variant, covers 30 tasks across five categories and contributes 12 new Polish tasks back to the broader MTEB suite.22
- German Text Embedding Clustering Benchmark covers German-language clustering and explores dimensionality reduction and continued pre-training for German BERT models.23
The pattern across these is consistent: each language community adapts the MTEB framework to its own corpus and task availability, and the resulting benchmarks fold back into the umbrella leaderboard. A model that wants to claim multilingual generality has to report against more than one of them.
Domain-specific and reasoning-intensive successors
The most consequential extensions are the ones that probe properties MTEB v1 was not designed to measure. CoIR covers code information retrieval: ten code datasets across eight retrieval tasks, designed to slot into the MTEB and BEIR frameworks.24 LongEmbed tests long-context retrieval up to 32,000 tokens, finding that most embedding models remain limited to context windows of 8,000 tokens or less and that position-interpolation techniques can extend that without retraining.25 AIR-Bench uses LLM-driven automated data generation to build dynamic IR test sets across domains and languages, avoiding reliance on fixed corpora.26
The two extensions that have produced the most striking empirical evidence are BRIGHT and RAR-b, both designed to test reasoning-intensive retrieval rather than surface-form retrieval.2728 Their evidence is the subject of the next section.
What MTEB Does Not Measure
A responsible reading of any benchmark begins with the question of what the benchmark cannot tell you. MTEB's own authors put a Limitations appendix at the back of the original paper, which is a reasonable starting point and worth quoting directly.1
What the authors acknowledged in 2022
The MTEB anchor paper's Appendix B names four limits in plain language:
- Long documents are missing. "MTEB covers multiple text lengths (S2S, P2P, S2P), but very long documents are still missing. The longest datasets in MTEB have a few hundred words, and longer text sizes could be relevant for use cases like retrieval." This is the gap that LongEmbed and the long-document retrieval task in MMTEB later addressed.
- Task balance is uneven. "Tasks in MTEB have a different amount of datasets with summarization consisting of only a single dataset." A model that excels at retrieval (many datasets, high weight in the aggregate) and underperforms at summarization (one dataset, low weight) can still score well on the average.
- Retrieval and clustering are English-only in v1. "MTEB contains multilingual classification, STS and bitext mining datasets. However, retrieval and clustering are English-only." This is the gap MMTEB was specifically designed to close.
- Data contamination is hard to control. "Our scale of experiments and that of model pre-training make controlling for data contamination challenging." This is the limit that produced the most consequential follow-up research.
What the maintainers documented in 2025
The "Maintaining MTEB" paper, written by the MTEB maintainers themselves, is the most candid document about the benchmark's reproducibility hazards.29 It reports concrete numbers. The leading e5-mistral-7b-instruct model achieves "a 95% zero-shot score on MTEB (English, v2), indicating that it was trained on only ∼5% of the benchmark's training splits." That sentence is worth reading twice. The leaderboard leader's score is not strictly speaking a zero-shot evaluation, because 95% of its training splits overlap with the evaluation splits.
The maintainers also catalogue evaluation-protocol differences that complicate cross-model comparison: "embedding models can use prefixes (E5 using query: and passage: as query and passage prompts for retrieval), prompts can differ per task-type (Nomic models), prompts can be added to both the query and the passage during retrieval." Each of those is a degree of freedom that two model-release teams can use differently. Two MTEB scores compared without knowing the prompt protocol are not strictly comparable.
What independent critiques have added
The most theoretically interesting critique is Weller, Boratko, Naim, and Lee's On the Theoretical Limitations of Embedding-Based Retrieval, presented at ICLR 2026.30 The paper proves that "the number of top-k subsets of documents capable of being returned as the result of some query is limited by the dimension of the embedding." In other words, a single-vector embedding of dimension d has a combinatorial ceiling on which document subsets it can express as top-k for any query. The authors construct a dataset (LIMIT) that exceeds that ceiling for current production-scale embeddings, and demonstrate that "even state-of-the-art models fail on this dataset despite the simple nature of the task." This is a foundational result about single-vector embedding architectures; it is not a critique of MTEB itself, but it is the kind of result MTEB cannot surface because it does not test for it.
The most empirically striking critique is BRIGHT.27 The paper reports that "the leading model on the MTEB leaderboard (Muennighoff et al., 2023) SFR-Embedding-Mistral (Meng et al., 2024), which achieves a score of 59.0 nDCG@10, produces a score of nDCG@10 of 18.3 on BRIGHT." A 41-point nDCG drop is not a small number. It is direct evidence that MTEB-style scores do not predict reasoning-intensive retrieval performance, where queries require chains of inference rather than surface matching. The point is not that MTEB is broken. The point is that MTEB measures one kind of retrieval, and BRIGHT measures another.
SFR-Embedding-Mistral evaluated against MTEB and BRIGHT in nDCG@10, higher is better, plotted on the same axis:
Same embedding model, two benchmarks: MTEB measures one kind of retrieval, BRIGHT measures another, and a leaderboard rank on the former says nothing about the latter.
The third critique, PTEB (Frank and Afli, EACL 2026), argues that "repeated tuning on a fixed suite can inflate reported scores and obscure real-world robustness" and that paraphrase-based stochastic evaluation reveals encoders are "sensitive to changes in token space even when semantics remain fixed."31 This is the leaderboard-gaming critique made formal: a static benchmark that everyone targets will, over time, reward models that overfit to that benchmark's specific phrasing.
How Model Releases Actually Use It
Every recent embedding model release frames its contribution against MTEB. The pattern is consistent across vendors and worth seeing as a list, because what changes between papers is the protocol, not the framing.
E5 (Microsoft, December 2022)
E5 introduced the query: / passage: prefix protocol that became a de facto standard among models that distinguish encoding roles.32 The release paper reports evaluation on "56 datasets from the BEIR and MTEB benchmarks" in both zero-shot and fine-tuned settings, and notes that "E5 is the first model that outperforms the strong BM25 baseline on the BEIR retrieval benchmark without using any labeled data." The MTEB protocol used here became part of the model's identity.
GTE (Alibaba DAMO Academy, August 2023)
GTE is presented as a "general-purpose text embedding model trained with multi-stage contrastive learning."33 The headline claim from the paper is comparison to OpenAI: "even with a relatively modest parameter count of 110M, GTE_base outperforms the black-box embedding API provided by OpenAI and even surpasses 10x larger text embedding models on the massive text embedding benchmark." The framing is "MTEB places us above OpenAI." That is the comparison MTEB was designed to make legible.
BGE / FlagEmbedding (BAAI, 2023)
BGE was released alongside C-MTEB.20 The English BGE models "achieve state-of-the-art performance on MTEB benchmark," and the Chinese variants are evaluated on the freshly-introduced C-MTEB. The model release and the benchmark were co-developed, which is a structurally awkward but common pattern: the team that wins the leaderboard sometimes is the team that built the leaderboard's language variant.
Nomic Embed (Nomic AI, February 2024)
Nomic Embed positions itself as "the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on the short-context MTEB benchmark and the long context LoCo benchmark."34 The novelty is the reproducibility ("we release the full curated training data and code that allows for full replication"), and MTEB is how the model release proves it is competitive with closed alternatives.
NV-Embed (NVIDIA, May 2024)
NV-Embed reports that its two variants "obtained the No.1 position on the MTEB leaderboard (as of May 24 and August 30, 2024, respectively) across 56 tasks."35 The training recipe ("two-stage contrastive instruction-tuning") is the contribution; the leaderboard ranking is the validation. This is the cleanest version of the leaderboard-as-target pattern in recent literature.
Snowflake Arctic-Embed (May 2024) and Jasper/Stella (December 2024)
Arctic-Embed is "a family of five Apache-2 licensed embedding models" trained with a recipe targeted "specifically at the MTEB Retrieval leaderboard."36 Jasper/Stella reports a "No.3 position on the MTEB leaderboard (as of December 24, 2024), achieving an average 71.54 score across 56 datasets" and uses MTEB performance as the optimization target driving its multi-teacher distillation recipe.37
Jina Embeddings 2 (October 2023)
Jina Embeddings 2 is "an open-source text embedding model capable of accommodating up to 8192 tokens" benchmarked on MTEB against OpenAI ada-002.38 The release argues for long-context evaluation as part of a complete embedding-model assessment.
Read together, these releases tell a single story. The competition for embedding-model quality is fought on the MTEB leaderboard. Whether that is a healthy state of affairs is a separate question, but it is the state of affairs, and the practical implication is that anyone selecting an embedding model in 2026 is selecting from a set of models that were all explicitly optimized to score well on MTEB.
Reading the Leaderboard Without Being Misled
The closing question for any benchmark guide is the operational one: how does a practitioner actually use MTEB without being misled by it? Three rules, drawn from the previous sections rather than imposed on them.
1. Treat the average score as a triage signal, not a verdict
The aggregate MTEB score across 56 (or 1,000+) tasks is useful for excluding obvious losers from a shortlist. It is much less useful for picking between two models near the top, because the average can hide a model that wins at retrieval and loses at classification. The right pattern is to filter by the task subset that resembles your workload. If the application is dense retrieval, look at the retrieval sub-average. If the application is classification with frozen features, look at the classification sub-average. The leaderboard exposes per-task breakdowns precisely because the average alone is not the right reading.
2. Verify the protocol when comparing across releases
Two MTEB scores reported by two model-release teams are not strictly comparable unless the evaluation protocol matches. The maintainers' paper enumerates the degrees of freedom (prefixes, per-task prompts, prompt placement) that two teams might handle differently. If a model release does not disclose its protocol, the score is anecdotal. If it does disclose the protocol, the score is a measurement that can be reproduced, which is the entire reason the benchmark exists.
3. Read MTEB-adjacent benchmarks for the gaps
If the application is reasoning-intensive retrieval (legal, medical, multi-hop question answering), the MTEB-leaderboard ranking is at best a starting point. BRIGHT exists specifically because MTEB rank does not predict that workload. If the application is long-document retrieval, LongEmbed is the relevant probe. If the application is code retrieval, CoIR is the relevant probe. The right reading of MTEB is as the first benchmark in a stack, not as the last. The leaderboard is a coordinate system for short-passage text retrieval and adjacent tasks. For workloads that fall outside that, other coordinate systems exist, and the modern leaderboard makes most of them visible.
When Measurement Reveals a Problem
Measurement is half of the discipline. The other half is what you do when the measurement comes back worse than you wanted. This section walks the three engineering levers a retrieval team reaches for most often when the BM25 loop or the MTEB-shaped evaluation says quality is below target: reranking, query transformation, and a small set of common failure modes that explain a disproportionate share of bad numbers. Each lever is independent of which retriever sits underneath it, so the toolkit applies to BM25, dense, sparse-neural, and hybrid retrieval stacks alike.
Reranking: the second pass
The retrievers described above (BM25, dense bi-encoders, sparse-neural encoders) all share a property that makes them fast: they encode the query and each document independently. The query becomes a vector once; the documents have pre-computed representations; similarity is a dot product or an inverted-index lookup. That independence is what lets a single retriever search millions of documents in milliseconds. It is also what limits what the retriever can recognize.
Cross-encoder rerankers break the independence. They take a single (query, document) pair as joint input and process it through a transformer's attention mechanism end-to-end, which lets the model attend over query tokens and document tokens simultaneously. The joint processing captures relevance signals that independent encoding cannot, like whether a document actually answers the question rather than merely discussing the same topic. The cost is that cross-encoders cannot pre-compute document representations, so they cannot be run over the full corpus at query time; running a cross-encoder against a million documents would take minutes to hours. Running it against fifty candidates takes a fraction of a second.41
This is the two-stage pipeline that has become a standard pattern in production retrieval. The first stage (the fast retriever, whether BM25, hybrid, or pure-vector) casts a wide net and returns 50 to 100 candidates with recall as the priority. The second stage (the cross-encoder reranker) refines the ordering and returns the top 5 to 10 with precision as the priority. The context window the LLM eventually sees is the second-stage output; the first stage's job is to make sure the answer is somewhere in the candidate set.
Stage 1 maximizes recall over the full corpus; stage 2 spends compute only on the candidates that survived. Inverting the order is what makes the pipeline tractable, because reranking 50 candidates costs roughly the same as searching one million in stage 1, and reranking the corpus directly would take hours.
In practice, teams report 10 to 25% improvements in retrieval relevance metrics after adding a reranker to their pipeline, depending on the workload and the strength of the first-stage retriever. Off-the-shelf cross-encoders that drop in cleanly include Cohere Rerank, BGE-reranker, and the cross-encoder/ms-marco-MiniLM family on Hugging Face. The added latency is real (50 to 200 ms for fifty candidates on a small cross-encoder) but typically acceptable for RAG response times measured in seconds. The pattern earns its name as a standard production move for the simple reason that it reliably improves precision at a small constant cost.
Query transformation: rewriting the question
Sometimes the retriever is fine and the query is the problem. Users phrase questions in ways that do not match how the corpus is written; technical queries use vocabulary that misses the document's terminology; very-specific questions retrieve narrow slices and miss the broader context an answer would need. Query transformation is the family of techniques that rewrite or augment the query before retrieval, with the goal of making the search input look more like the document distribution it is being searched against.
HyDE (Hypothetical Document Embeddings) is the most counterintuitive of the family.42 Instead of searching with the user's question, ask an LLM to write a hypothetical answer to the question first, then use that hypothetical answer as the search input. The hypothetical answer, even when factually imperfect, is stylistically much closer to documents in the corpus (which are written in the style of answers, not questions). For dense retrieval, this can be a substantial improvement because the embedding model was trained on document-to-document similarity more than on question-to-document similarity. The cost is one extra LLM call per query, plus the risk that the hypothetical answer steers retrieval in unintended directions for ambiguous queries.
Multi-query retrieval generates several reformulations of the original query, runs each through the retriever independently, and merges the result lists with reciprocal rank fusion. The intuition is that different phrasings activate different parts of the index: a query about "impact of sleep deprivation on memory" might miss "Cognitive Effects of Insufficient Rest," but a variant phrased the second way will catch it. Three to five variants is the usual sweet spot; the cost is linear in the number of variants.
Query expansion is the lightest-weight variant in the family. Instead of generating full reformulations, append related terms to the original query: synonyms, acronyms, plurals, or domain-specific vocabulary. For sparse retrieval specifically, expansion is powerful because it directly addresses the vocabulary mismatch problem; BM25 cannot match a document about "chief executive officer remuneration" against a query for "CEO pay" unless something connects the vocabularies. Expansion techniques go back to classical IR (relevance feedback, thesaurus-based expansion) and predate the modern LLM era by decades.
Step-back prompting takes the opposite approach.43 Instead of rephrasing or expanding the original question, ask the LLM to abstract a more general "step-back" question from the specific one, then search with both. The original query "what is the degradation rate of lithium iron phosphate batteries at 45 degrees Celsius?" retrieves narrow technical documents; the step-back query "what factors affect lithium-ion battery degradation?" retrieves the foundational documents that contextualize the specific number. The union of the two result sets gives the LLM both the precise data and the surrounding context an answer typically needs. The technique works particularly well when the user's query uses vocabulary that does not match the corpus's vocabulary, and the cost is similar to HyDE (one extra LLM call) but the prompt is simpler because the LLM is abstracting rather than fabricating.
Common failure modes the metrics catch (and the ones they do not)
Even well-designed retrieval pipelines fail in predictable ways. Four patterns explain a disproportionate share of production incidents, and recognizing them is the difference between debugging in minutes and debugging in days.
Stale embeddings. A document is updated in the source-of-truth system but the pipeline that re-embeds and re-indexes the changed document does not run, or runs late. The vector index still contains the old embedding, which may no longer match queries about the updated content. Operationally trivial; constantly causes real production incidents. Every document-update pipeline needs an embedding-refresh hook, ideally one that fails loudly when it misses.
Chunk-boundary artifacts. The answer to a query spans two chunks, but only one is retrieved. The LLM receives half the answer and either confabulates the rest or produces a plausibly-wrong response. Overlapping chunks help (the standard 100-token overlap that most chunkers default to is exactly this fix), and parent-document retrieval, where the retrieved chunk is replaced or augmented by its containing section, helps further.
Embedding-model drift. An embedding model is upgraded for better quality, but the existing vector index was built with the old model, and embedding spaces are not interoperable across model versions. Old vectors and new queries no longer share a coordinate system, so similarity scores become meaningless. The fix is to re-embed the entire corpus when the model changes, which is obvious in principle but easy to overlook in practice. The MTEB loop catches this in advance only if the team re-runs the loop after each model upgrade; the BM25 loop is immune because there is no embedding to drift.
Score-threshold traps. Some systems filter results by a minimum similarity score (e.g., "return only documents with cosine similarity above 0.7"). This seems reasonable but fails for queries that are genuinely difficult, returning nothing when returning the best available document (even if imperfect) would have been more useful. Prefer rank-based cutoffs (top-K) over score-based cutoffs unless there is a strong reason to use a threshold, because the score distribution shifts with the query and the threshold that works on the training distribution rarely transfers cleanly to production traffic.
The language model at the end of the pipeline is only as good as the context it receives. A team can swap in a more powerful model, refine the prompt template, and tune generation parameters, but none of that matters if the retriever surfaced the wrong documents. The discipline this section lays out (reranking, query transformation, failure-mode awareness) is what closes the gap between "the retriever ran" and "the retriever did its job." Measurement tells you whether the gap exists. The levers above are how you close it.
The Discipline Behind the Numbers
The two loops the article walked, the closed BM25 loop and the open-coordinate-system MTEB loop, are not in competition. They cover different retrievers and they require different infrastructure, and a serious retrieval system in 2026 typically runs both at different stages of its lifecycle. The BM25 loop closes inside your own engineering and is the right discipline whenever the retriever is BM25, or whenever you are evaluating a hybrid system on a slice you control. The MTEB loop is what makes the embedding-model component of the same system comparable across vendors, which is the only way a swap decision can be more than a hunch.
What unifies the two loops is the underlying claim the article opened with: measurement is the precondition for improvement, and the discipline of measurement matters more than the specific tool. A field with a flawed but quantitative measurement procedure is a field where claims about retriever quality can be checked, contested, and replaced with better claims. A field without a quantitative measurement procedure is a field where the comparison of two retrievers is a matter of who has the louder marketing department. BM25 evaluation has decades of academic infrastructure behind it; MTEB has three years and is still maturing. Both are imperfect; both make the next improvement legible in a way nothing else does.
None of the critiques in the MTEB sections are arguments against using MTEB. They are arguments for using it carefully. The maintainers themselves document the reproducibility hazards. The successor benchmarks (BRIGHT, RAR-b, LongEmbed, AIR-Bench, MMTEB) exist because the field is healthy enough to keep probing the gaps. The dimension-ceiling result from Weller et al. is a foundational result about single-vector embeddings that the benchmark itself helped surface, because the standardized comparison was what made the gap quantifiable in the first place.
MTEB is neither perfect nor finished, but it is what the field has agreed to push against, and pushing against it has produced more honest embedding models than any alternative process the field has tried. Selecting an embedding model on the basis of its MTEB score, then verifying that score against your own workload (using the same evaluation infrastructure that closed the BM25 loop above), is engineering. Doing it without either loop, on either retriever, is taste. The retrieval system that gets shipped is only as good as the measurement loop that lets the team know whether they shipped an improvement.
References
- Muennighoff, N., Tazi, N., Magne, L., & Reimers, N. (2023). "MTEB: Massive Text Embedding Benchmark." Proceedings of EACL 2023. Anchor paper. Defines the original eight tasks, 58 datasets, 112 languages, and 33-model evaluation.
- embeddings-benchmark. (2026). "MTEB GitHub repository." Source, README, task registry, and evaluation runner.
- embeddings-benchmark. (2026). "MTEB documentation landing page." Current task and language counts (1,000+ tasks, 1,000+ languages, image and text modalities).
- embeddings-benchmark. (2026). "MTEB task registry source: nq_retrieval.py." Concrete example of how a task is registered as a Python class with TaskMetadata.
- embeddings-benchmark. (2026). "MTEB README · Example Usage." Python and CLI invocation patterns.
- embeddings-benchmark. (2026). "Defining a Custom Model." The EncoderProtocol contract for wrapping a custom embedding model.
- Enevoldsen, K., Chung, I., Kerboua, I., Kardos, M., et al. (2025). "MMTEB: Massive Multilingual Text Embedding Benchmark." ICLR 2025. Community-driven expansion to 10 tasks, 400+ datasets, 250+ languages. Documents the two-day A100-GPU cost of a full English MTEB pass.
- Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., & Gurevych, I. (2021). "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models." NeurIPS 2021 Datasets and Benchmarks. The retrieval suite MTEB inherits.
- Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., & Deng, L. (2016). "MS MARCO: A Human Generated MAchine Reading COmprehension Dataset." NIPS 2016 Cognitive Computation Workshop.
- Kwiatkowski, T., et al. (2019). "Natural Questions: A Benchmark for Question Answering Research." Transactions of the ACL.
- Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., & Manning, C. D. (2018). "HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering." EMNLP 2018.
- Thorne, J., Vlachos, A., Christodoulopoulos, C., & Mittal, A. (2018). "FEVER: a Large-scale Dataset for Fact Extraction and VERification." NAACL-HLT 2018.
- Voorhees, E., et al. (2020). "TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection." SIGIR Forum.
- Wadden, D., et al. (2020). "Fact or Fiction: Verifying Scientific Claims (SciFact)." EMNLP 2020.
- Maia, M., et al. (2018). "FiQA 2018: Financial Opinion Mining and Question Answering." WWW 2018 shared task.
- Wachsmuth, H., Syed, S., & Stein, B. (2018). "Retrieval of the Best Counterargument without Prior Topic Knowledge (ArguAna)." ACL 2018.
- Diggelmann, T., Boyd-Graber, J., Bulian, J., Ciaramita, M., & Leippold, M. (2020). "CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims." NeurIPS 2020 Workshop.
- Cohan, A., Feldman, S., Beltagy, I., Downey, D., & Weld, D. S. (2020). "SPECTER and SciDocs: Document-level Representation Learning." ACL 2020.
- Hasibi, F., et al. (2017). "DBpedia-Entity v2: A Test Collection for Entity Search." SIGIR 2017.
- Xiao, S., Liu, Z., Zhang, P., Muennighoff, N., Lian, D., & Nie, J.-Y. (2024). "C-Pack: Packed Resources For General Chinese Embeddings." SIGIR 2024. Introduces C-MTEB and the BGE family.
- Ciancone, M., Kerboua, I., Schaeffer, M., & Siblini, W. (2024). "MTEB-French: Resources for French Sentence Embedding Evaluation and Analysis." arXiv.
- Poświata, R., Dadas, S., & Perełkiewicz, M. (2026). "PL-MTEB: Polish Massive Text Embedding Benchmark." ACL 2026 Findings.
- Wehrli, S., Arnrich, B., & Irrgang, C. (2024). "German Text Embedding Clustering Benchmark." arXiv.
- Li, X., et al. (2025). "CoIR: A Comprehensive Benchmark for Code Information Retrieval Models." ACL 2025 Main.
- Zhu, D., et al. (2024). "LongEmbed: Extending Embedding Models for Long Context Retrieval." EMNLP 2024.
- Chen, J., et al. (2025). "AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark." ACL 2025 Main.
- Su, H., et al. (2024). "BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval." 1,384 reasoning-intensive queries; SFR-Embedding-Mistral scores 59.0 on MTEB and 18.3 on BRIGHT.
- Xiao, C., Hudson, G. T., & Al Moubayed, N. (2024). "RAR-b: Reasoning as Retrieval Benchmark." arXiv.
- Chung, I., Kerboua, I., Kardos, M., Solomatin, R., & Enevoldsen, K. (2025). "Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks." arXiv. The maintainers' own account of reproducibility hazards.
- Weller, O., Boratko, M., Naim, I., & Lee, J. (2026). "On the Theoretical Limitations of Embedding-Based Retrieval." ICLR 2026. Proves a dimension-based combinatorial ceiling on single-vector retrievers.
- Frank, M., & Afli, H. (2026). "PTEB: Towards Robust Text Embedding Evaluation via Stochastic Paraphrasing at Evaluation Time with LLMs." EACL 2026 Main.
- Wang, L., Yang, N., Huang, X., Jiao, B., Yang, L., Jiang, D., Majumder, R., & Wei, F. (2022). "Text Embeddings by Weakly-Supervised Contrastive Pre-training (E5)." Microsoft arXiv.
- Li, Z., Zhang, X., Zhang, Y., Long, D., Xie, P., & Zhang, M. (2023). "Towards General Text Embeddings with Multi-stage Contrastive Learning (GTE)." Alibaba DAMO arXiv.
- Nussbaum, Z., Morris, J. X., Duderstadt, B., & Mulyar, A. (2024). "Nomic Embed: Training a Reproducible Long Context Text Embedder." Nomic AI arXiv.
- Lee, C., Roy, R., Xu, M., Raiman, J., Shoeybi, M., Catanzaro, B., & Ping, W. (2024). "NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models." NVIDIA arXiv.
- Merrick, L., Xu, D., Nuti, G., & Campos, D. (2024). "Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models." Snowflake arXiv.
- Zhang, D., Li, J., Zeng, Z., & Wang, F. (2024). "Jasper and Stella: Distillation of SOTA Embedding Models." arXiv.
- Günther, M., et al. (2023). "Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents." Jina AI arXiv.
- Elastic. (2024). "Elastic ELSER: Retrieval ranking on Hugging Face MTEB Leaderboard." Elasticsearch Labs. ELSER v2 ranks in the top-10 of MTEB Retrieval when competitor flavors are grouped, evidence that sparse-neural encoders are first-class citizens of MTEB.
- Ostendorff, M. (2024). "Run BM25 baseline on MTEB retrieval tasks." GitHub gist. BM25 baselines for MTEB retrieval tasks computed via separate tooling, illustrating that BM25 does not fit the MTEB EncoderProtocol natively.
- Nogueira, R., & Cho, K. (2019). "Passage Re-ranking with BERT." arXiv. The reference paper for cross-encoder reranking, which demonstrated substantial precision gains over first-stage retrieval on standard IR benchmarks and established the two-stage pattern as a production default.
- Gao, L., Ma, X., Lin, J., & Callan, J. (2022). "Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE)." arXiv. Introduces hypothetical document embeddings: instead of searching with the query, search with an LLM-generated hypothetical answer that better matches the document distribution.
- Zheng, H. S., et al. (2023). "Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models." Google DeepMind. Introduces step-back prompting: generate a broader question that contextualizes the specific one, then search with both. Strongest for technical or domain-specific queries where the user's vocabulary may not match the corpus's.