← All Articles

The Simplest Possible RAG

Most RAG tutorials open with a vector database. This one does not. The simplest production RAG is classic Elasticsearch keyword search feeding top-k passages into a single LLM call, and for most corpora it is the right starting point.

In Brief

The simplest production RAG that does useful work is a BM25 retriever and one LLM call. No vector database. No reranker. Three Python functions, roughly forty lines, and a measurable improvement over the LLM answering on its own. The reason this matters is empirical: Anthropic's own Contextual Retrieval evaluation, the one most often cited as evidence that you need embeddings to do RAG well, uses BM25 as a core component throughout. Embeddings and rerankers each shave another increment off the failure rate (35 percent, 49 percent, 67 percent in their numbers), but they are layered on top of BM25, not in place of it. The lexical retriever is the foundation everything else gets stacked on.

This article shows the foundation itself. Three functions of Python, the latency profile to expect (sub-100 ms for retrieval, one to seven seconds for the LLM call depending on the model), the 2026 cost ledger (Sonnet 4.6 at three dollars per million input tokens, Haiku 4.5 at one dollar, cached input at a tenth of those rates), and the working argument for why this should be the default starting point. The companion simple-rag-walkthrough demo is embedded so the request path is visible end to end. If your corpus does not yet have a measured failure rate, the lesson is to build this version, measure it, and only add complexity where the measurement says the complexity will pay back.

The 2020 paper that introduced RAG describes a two-part architecture. Lewis, Perez, Piktus, Petroni and co-authors call it "parametric and non-parametric memory for language generation." The parametric memory is what the language model learned during pre-training. The non-parametric memory is an external index that the model consults at inference time. Their paper's instantiation used a dense vector index of Wikipedia. The architectural definition does not require it.

The retriever is an abstract module. Anything that can rank documents by relevance to a query satisfies the role: a vector store, an inverted index, a knowledge graph, a SQL database, a literal grep. Every subsequent paper that calls itself RAG uses some realization of this module, and the choice of realization is engineering taste, not definition. Issue #117 works through the definitional argument in detail; the short version is that Elasticsearch with BM25 satisfies the retriever role, and therefore Elasticsearch + top-k + LLM is RAG.

This article shows what the simplest realization actually looks like. Elasticsearch's BM25 keyword retriever in front of an Anthropic Claude or OpenAI GPT call. No embeddings, no vector database, no reranker. The retriever is the same kind powering Stack Overflow's question search, the same kind running underneath IBM Watson Discovery, the same kind that won every TREC retrieval competition before neural retrievers were a thing.

The companion demo plays the pattern end to end. A query types itself in, ten BM25-ranked results stagger onto the page, the top three highlight as "context for the LLM" and stack into a Context Window panel, an LLM panel pulses, and a synthesized answer streams into the panel below it. The whole sequence has no semantic embedding anywhere: BM25 does the retrieval, and a single API call does the synthesis.

Elasticsearch top-3 retrieval on the query "stuck pipe with wellbore instability," feeding an LLM synthesis layer. The demo auto-loops; click any layer once the cycle finishes to bring it to the front. Open full-screen.

The argument of this article is that the simplest realization should be the default, not a fallback. Three things make it worth taking seriously as a production pattern: BM25 retrieval is fast, with production Elasticsearch indexes returning top-k against hundreds of millions of postings in tens of milliseconds; LLM synthesis is the expensive part, so the retriever only needs to be "good enough that the model can answer"; and the complexity tax of adding vector retrieval is severe, in ways that issue #114 documents in operational detail.

Private note

This is the angle. The mainstream framing equates RAG with vector DB, the original paper does not, and the simplest realization works. Adding complexity is something you do AFTER you measure that the simplest realization is failing on your data. The article should be calm about this, not zealous, just empirically grounded. Anthropic's own Contextual Retrieval paper later in this piece is the strongest possible refutation of vector-first thinking, because it comes from a frontier lab whose interests would naturally push the other way.

. . .

Long Context Is Not a Substitute

A reasonable objection to RAG in 2026 is "why bother retrieving when Claude Opus 4.7 has a 1M-token context window?" Just dump the corpus into the prompt and let the model find what it needs.

Anthropic answers this in their own documentation. From the Claude context-windows page: "As token count grows, accuracy and recall degrade, a phenomenon known as context rot. This makes curating what's in context just as important as how much space is available." The 1M window is not free. It degrades the model's ability to find the relevant passage within all that text. The longer the context, the more aggressively the model's attention has to compete with itself.

Retrieval is curation. RAG with a small top-k preempts both failure modes that long-context-only systems exhibit: the model never sees the relevant passage (if you didn't include enough of the corpus) or has to find the passage in a haystack (if you included too much). A precise top-3 of relevant passages plus a focused prompt is more accurate than a 1M-token context dump, even on a model that can technically accept the dump.

Issue #112 works through the RAG-vs-long-context tradeoff in detail. The summary: long context complements retrieval; it does not replace it. The 1M token window is most useful when you want to include a few large coherent documents (a whole contract, a whole codebase) rather than thousands of short passages.

. . .

How a BM25 Score Gets Produced

BM25 has been Elasticsearch's default scoring function since version 5.0, which adopted Lucene 6 in 2016. The algorithm itself dates to the 1994 "Okapi at TREC-3" paper by Robertson, Walker, Hancock-Beaulieu, Gull, and Lau. The Lucene BM25Similarity class documents the implementation directly; default parameters are k1 = 1.2 and b = 0.75.

The formula is a probabilistic reweighting of term frequencies, with two corrections. Term frequency saturates so that doubling a query term's occurrence in a document does not double its contribution; the saturation curve is controlled by k1. Document length normalizes so that a long document with the same term count as a short one receives a lower score; the normalization weight is controlled by b. The IDF component, computed as log(1 + (docCount - docFreq + 0.5)/(docFreq + 0.5)), penalizes terms that appear in many documents. A term in 100% of documents contributes nothing. A term in 0.1% of documents contributes a lot.

The result is a scoring function that prefers documents where the query terms are rare in the corpus but dense in this specific document. For a query like "stuck pipe with wellbore instability" against a drilling-operations corpus, BM25 surfaces documents about wellbore instability mechanisms ahead of documents about generic drilling operations, because "wellbore instability" is both rare in the corpus as a whole and dense in those specific documents.

Private note

A common error in writing about BM25 for an LLM-savvy audience is to dismiss it as "just keyword matching." It isn't: the IDF component is sophisticated probabilistic reweighting, the saturation curve is empirically tuned, and decades of IR research went into the algorithm. Treating BM25 as obsolete because it does not use neural embeddings misses what BM25 actually does. The right framing is: BM25 is a strong baseline that vector search must beat to justify its complexity, and on plenty of corpora it does not.

Doug Turnbull, who writes Elastic's official BM25 explainer, frames the parameter behavior plainly. "If b is bigger, the effects of the length of the document compared to the average length are more amplified." "k1 is a variable which helps determine term frequency saturation characteristics. That is, it limits how much a single query term can affect the score of a given document." Tuning these parameters matters at the margin; for most corpora the defaults (k1 = 1.2, b = 0.75) are within a few percentage points of optimal.

. . .

The Query DSL

The simplest BM25 query in Elasticsearch is a match query against a text field.

{
  "query": {
    "match": {
      "content": "stuck pipe with wellbore instability"
    }
  }
}

This runs BM25 against the analyzed tokens of the content field. The query string is itself analyzed through whatever analyzer is configured for the field (the standard analyzer by default, or the English analyzer if you've opted in to stemming and stopword removal). Elasticsearch tokenizes, lowercases, optionally stems, then scores each candidate document by BM25 against the resulting token bag.

For multi-field search, the multi_match query lets you query several fields at once, optionally with per-field boosts.

{
  "query": {
    "multi_match": {
      "query": "stuck pipe with wellbore instability",
      "fields": ["title^2", "content"]
    }
  }
}

The ^2 boosts the title field's contribution by 2x. Titles are short and information-dense; this is a typical pattern. Elasticsearch supports several multi-match strategies, including best_fields (use the score from the best-matching field, the default) and cross_fields (treat all fields as one big field, useful for documents where the title and body share vocabulary).

For metadata filtering, the bool query combines a scored must clause with one or more filter clauses that constrain the result set without affecting scores.

{
  "query": {
    "bool": {
      "must": {
        "match": {
          "content": "stuck pipe with wellbore instability"
        }
      },
      "filter": [
        {"range": {"published": {"gte": "2020-01-01"}}},
        {"term": {"doc_type": "technical_paper"}}
      ]
    }
  }
}

Filter context queries cache, and Query Quotient's published benchmarks put them at 2-5x faster than equivalent must-context queries. For RAG against a corpus with structured metadata (publication dates, document types, source restrictions, access controls), filters are how you constrain the corpus before BM25 ranks it.

Top-k is just the size parameter on the search request, defaulting to 10 per Elasticsearch's pagination documentation. The highlight clause asks Elasticsearch to return matched fragments alongside each hit. This is what gets passed to the LLM, not the whole document.

{
  "size": 5,
  "query": {
    "match": {"content": "stuck pipe with wellbore instability"}
  },
  "highlight": {
    "fields": {
      "content": {
        "fragment_size": 200,
        "number_of_fragments": 2
      }
    }
  }
}

The fragment size and count are tunable. For RAG, two 200-token fragments per document is a reasonable default: enough to give the LLM the context around the matching terms, not so much that the prompt explodes.

. . .

Forty Lines of Python

The Python client is the official elasticsearch package, installable with pip install elasticsearch and connected to a local cluster started by Elastic's one-liner curl -fsSL https://elastic.co/start-local | sh.

from elasticsearch import Elasticsearch

es = Elasticsearch("http://localhost:9200")

def search(query: str, k: int = 3) -> list[dict]:
    response = es.search(
        index="docs",
        query={"match": {"content": query}},
        highlight={"fields": {"content": {
            "fragment_size": 200,
            "number_of_fragments": 1,
        }}},
        size=k,
    )
    return [
        {
            "title": hit["_source"]["title"],
            "snippet": hit["highlight"]["content"][0],
            "score": hit["_score"],
        }
        for hit in response["hits"]["hits"]
    ]

Three lines of orchestration logic. The highlight["content"][0] extraction takes the first matched fragment, which is the span of text Elasticsearch determined was most relevant.

The LLM synthesis layer follows Anthropic's recommended pattern for RAG prompts: retrieved passages wrapped in <documents> XML, the user query at the bottom, and an explicit instruction to quote relevant passages before answering. Anthropic measures up to a 30% quality improvement on multi-document inputs when the query comes after the context and is structured this way.

import anthropic

client = anthropic.Anthropic()

def synthesize(query: str, passages: list[dict]) -> str:
    docs_xml = "\n".join(
        f'  <document index="{i+1}">\n'
        f'    <source>{p["title"]}</source>\n'
        f'    <document_content>{p["snippet"]}</document_content>\n'
        f'  </document>'
        for i, p in enumerate(passages)
    )
    prompt = f"""<documents>
{docs_xml}
</documents>

Find quotes from the documents that are relevant to answering the question.
Place these in <quotes> tags. Then synthesize an answer grounded in the
quoted passages. Place the answer in <answer> tags. If the documents do
not contain enough information to answer the question, say so explicitly
rather than inventing an answer.

Question: {query}
"""
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2000,
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )
    return response.content[0].text

The quote-first instruction comes directly from Anthropic's prompt-engineering documentation: "For long document tasks, ask Claude to quote relevant parts of the documents first before carrying out its task. This helps Claude cut through the noise of the rest of the document's contents." The temperature=0 setting reduces variation across runs, which matters for evaluation and for production reproducibility.

End-to-end, the system is one more function.

def rag(query: str) -> str:
    passages = search(query, k=3)
    return synthesize(query, passages)

answer = rag("stuck pipe with wellbore instability")
print(answer)

That is the entire RAG system. The full pipeline runs in three functions and roughly forty lines of Python, with no vector database, no embedding model, no reranker, no framework, no orchestration library, and no infrastructure beyond Elasticsearch and an LLM API key.

Private note

Every time I see a "build RAG with LangChain plus Pinecone plus LlamaIndex plus a reranker" tutorial, I think about this forty lines. Most of the complexity in the typical RAG tutorial exists to solve problems the simplest approach does not have. Vector DBs solve a problem (vocabulary mismatch in retrieval) that may or may not be your problem. Rerankers solve a problem (top-k quality at small k) that may or may not be your problem. Start with the forty lines. Add what you measure to be necessary. The danger is not undertooling; the danger is overtooling before you have evidence.

The same shape exists in the major framework wrappers. Haystack's ElasticsearchBM25Retriever, LangChain's ElasticsearchRetriever with a body_func that returns a match query, LlamaIndex's ElasticsearchStore: all three can run the BM25-only pattern with about the same line count. Elastic's own published reference notebook, elasticsearch-labs/notebooks/langchain/self-query-retriever-examples/chatbot-with-bm25-only-example.ipynb, builds a chatbot with multi_match plus a bool filter and nothing else.

. . .

Cost and Latency

For a single query against a 100,000-document corpus, the latency profile of the forty-line system breaks down predictably. Elasticsearch's BM25 retrieval is the cheap, fast step. The LLM call is the expensive, slow step. The cost arithmetic flips the usual question of whether RAG is "too expensive."

turbopuffer's published BM25 benchmarks give concrete millisecond figures across query workloads. The table shows latency scales sub-linearly with corpus size: a query against 859,959 postings completes in 1.0 ms; a query against 363 million postings completes in 107.7 ms. For a typical RAG corpus of 100K documents at 1,000 tokens each, BM25 retrieval is comfortably under 50 ms. Query Quotient's operational guide puts the realistic alerting thresholds at p95 above 500 ms (investigate) and p99 above 1 second (critical), which is the latency budget for the retrieval step.

The LLM call is what dominates. Independent benchmarking by Artificial Analysis measures Claude Sonnet 4.6 at 1.44 seconds time-to-first-token and 50.1 output tokens per second on Anthropic's API. A 200-token synthesized answer therefore takes roughly 5.4 seconds wall-clock from query to last token. For interactive use, streaming makes the perceived latency closer to TTFT than to completion, but the cost numbers are based on the completion.

Step	Typical Latency	Typical Cost (per query)
Elasticsearch BM25 (top-3, 100K docs)	5-50 ms	Fractions of a cent (server cost)
Claude Haiku 4.5 synthesis	~3-5 sec	~$0.0035 (1k in, 300 out)
Claude Sonnet 4.6 synthesis	~5-7 sec	~$0.011 (1k in, 300 out)
Claude Opus 4.7 synthesis	~11 sec TTFT, then streaming	~$0.0175 (1k in, 300 out)

Anthropic's published pricing is $1 per million input tokens and $5 per million output for Haiku 4.5; $3 / $15 for Sonnet 4.6; $5 / $25 for Opus 4.7. Anthropic also offers cache-read pricing at 0.1x the standard input rate, which materially changes the economics for production workloads with stable system prompts: a system prompt that gets reused across many queries is charged once at the standard rate and then at 10% of that rate for each subsequent query within the cache TTL.

The cost lesson is that the retriever is not what costs money. The model is what costs money. Reducing tokens-in by precise retrieval (top-k = 3 with focused fragments instead of dumping the corpus) is the largest cost lever. A team that builds a vector pipeline to "improve retrieval" without first measuring whether retrieval is the bottleneck is optimizing the wrong end of the latency-cost equation.

Private note

The cost arithmetic flips an instinctive reaction. People look at LLM token pricing and think "expensive." Then they spend engineering time on retrieval optimization without measuring whether retrieval was ever the problem. The right first move is to instrument the existing system, count tokens-in and tokens-out per query, multiply by the published rates, and figure out where the spend actually is. Almost always it is the model, not the retriever. Better retrieval reduces tokens-in, which is the largest direct lever. Vector retrieval improves recall at the same k, which is a smaller lever than reducing k.

. . .

What BM25 Will Not Do

BM25 has a well-documented weakness: it cannot match paraphrases. A query "how do I cancel my subscription" against a corpus where the relevant passage says "terminate billing arrangement" will miss because BM25 has no concept that "cancel subscription" and "terminate billing" are semantically equivalent. The two strings share no surface-form tokens after stemming and stopword removal; BM25 cannot connect them.

Embeddings solve this problem. A dense vector model trained on natural-language pairs learns that "cancel subscription" and "terminate billing" live near each other in vector space. Vector retrieval finds the passage by semantic similarity rather than by surface match. This is the principal capability that dense retrieval adds over BM25, and it matters for any corpus where user queries and document authors do not share vocabulary.

Anthropic's Contextual Retrieval announcement frames the trade-off directly. From their published blog post: "While embedding models excel at capturing semantic relationships, they can miss crucial exact matches." Embeddings solve the paraphrase problem; BM25 solves the exact-match problem. Anthropic's recommendation is not to choose one or the other but to use both.

The Contextual Retrieval paper's numbers, verbatim:

Contextual Embeddings reduced the top-20-chunk retrieval failure rate by 35% (5.7% → 3.7%).
Combining Contextual Embeddings and Contextual BM25 reduced the failure rate by 49% (5.7% → 2.9%).
Adding a reranker on top reduced the failure rate by 67% (5.7% → 1.9%).

The pattern is the loadbearing detail. Each step in their recommended stack adds incremental quality. BM25 is in every step. Anthropic's published production RAG recipe is BM25 plus embeddings plus reranker, not embeddings alone. The base case (BM25 alone) is the 5.7% failure rate that the rest of the stack is trying to drive down.

Private note

The Anthropic Contextual Retrieval numbers are the most important data point in this whole article. They are publishing a state-of-the-art recipe and that recipe USES BM25. Not "you could use BM25." Not "BM25 is one option." It uses BM25. As a core component. This is the strongest possible refutation of the "vector DB or nothing" narrative, coming from Anthropic itself. If a frontier lab whose business interests would naturally push toward "use our embeddings" instead publishes a recipe that says "use BM25 alongside our embeddings," the argument for vector-first thinking is over.

The honest accounting is that BM25 alone is not the highest-quality retrieval available. It is the floor that all the more elaborate methods build on. The question for a team building RAG today is not "BM25 or vectors?" but "how far do I need to climb from the BM25 floor?"

. . .

When BM25 Alone Is Enough

There are real-world corpora and workloads where the floor is sufficient. The marginal value of adding embeddings depends on how often your users' queries fail to share vocabulary with your documents, and that depends on the corpus.

Case	Why BM25 is enough
Literal queries (error codes, API names, SKUs, IDs)	Exact match beats semantic similarity. "Error E0402" finds "E0402" via BM25 instantly; a vector search returns approximate near-matches that may not include the exact code.
Internal documentation where queries and authors share vocabulary	Vocabulary mismatch is rare. Engineers asking about the "Foo service deployment pipeline" are searching docs written by other engineers who use the same terms.
Small, focused corpora (under 100K documents)	The complexity of vector indexing is unnecessary at this scale. BM25 with a good analyzer covers the surface.
Cost or latency constraints	No embedding step at index or query time, no second index, and no ANN library, for roughly half the operational cost of a hybrid stack.
Explainability requirements	BM25 scores trace to specific term matches in specific documents. Vector scores are opaque distances in a high-dimensional space. For regulated industries, BM25 is auditable in a way vector retrieval is not.
Multilingual without re-indexing	Elasticsearch's language analyzers cover 36 languages out of the box. Each language gets its own stemming and stopword handling. A vector pipeline would need an embedding model swap or a multilingual embedder.

A practitioner post that landed last December puts the choice plainly. "Start with BM25. Prove it's not enough with real queries. Add vector search surgically, where it fills a clear, measured gap." This is the operationally sound default. Build the BM25 system, deploy, measure where it fails, and add complexity in proportion to the measured failure.

Most teams will not progress past the BM25 system. The corpora where vector retrieval substantially outperforms BM25 are corpora where users ask questions in language meaningfully different from the authors who wrote the documents. That gap is real for some workloads (customer support, where users describe symptoms and docs describe causes) and largely absent for others (technical documentation, where both authors and queriers use the same jargon).

. . .

If You Need More

When BM25 alone proves insufficient on your data, Elasticsearch has a first-class hybrid retrieval mechanism that does not require leaving the platform. The rrf retriever combines multiple sub-retrievers (BM25, dense kNN, semantic, sparse) via Reciprocal Rank Fusion, a method from Cormack, Clarke, and Büttcher's 2009 SIGIR paper.

The mechanism is mechanical. RRF assigns each document a fused score across multiple result lists: score(D) = Σ 1 / (k + rank(D, list_i)). The constant k dampens the contribution of low-ranked documents. The fusion has no learned parameters; Elastic's documentation says "RRF requires no tuning."

{
  "retriever": {
    "rrf": {
      "retrievers": [
        {
          "standard": {
            "query": {
              "match": {"content": "stuck pipe with wellbore instability"}
            }
          }
        },
        {
          "knn": {
            "field": "content_vector",
            "query_vector": [0.123, -0.456, ...],
            "k": 50,
            "num_candidates": 100
          }
        }
      ],
      "rank_window_size": 50,
      "rank_constant": 20
    }
  }
}

The application code does not change. es.search(index="docs", retriever={"rrf": {...}}) returns the same response shape as a plain BM25 query. The retriever swap is transparent to the synthesis layer. If you started with BM25 and want to add embeddings later, you add an embedding pipeline at index time, add the content_vector field to your mapping, and change the search call to use the rrf retriever. The forty-line system becomes a fifty-line system.

The next step up is reranking. A cross-encoder reranker reorders the top-k from the fusion based on a deeper similarity model. Anthropic's Contextual Retrieval numbers (49% → 67% failure-rate reduction) come from adding reranking on top of hybrid retrieval. Reranking is a real quality lever, but it adds another model in the loop and another point of latency. Pamela Fox's worked Azure example shows the pattern in production: BM25 plus vectors plus RRF plus a semantic reranker, where the reranker is the final accuracy gate before the LLM sees the passages.

. . .

What to Build First

If you are building RAG today and do not already have strong evidence about your retrieval failure modes, the prescription is straightforward.

Start with Elasticsearch's BM25 retriever. Use the english analyzer for English-language corpora, the appropriate language analyzer for others. Single content field with the text mapping type.
Index your documents with reasonable chunking. Most production guidance lands in the 200-1000 token range per chunk; tune by corpus.
Search with top-k = 3 to top-k = 10 depending on context budget and the size of your retrieved fragments.
Pass retrieved passages to Claude or GPT inside <documents> XML wrapping, with the query at the bottom and a quote-first grounding instruction.
Deploy.
Measure. Where does retrieval fail? Catalog real queries that return the wrong top-k. If the failures cluster around paraphrase or vocabulary mismatch, add an embedding index alongside BM25 and switch the retriever to rrf. If the failures cluster around ordering (right documents in top-50 but wrong order in top-3), add a reranker on top.
Re-measure.

Most teams will not need to go past step 5. The most common production mistake is skipping straight to a vector pipeline before there is any evidence that BM25 alone fails. The Anthropic Contextual Retrieval numbers tell you what each component buys: 35% (vectors alone), 49% (BM25 plus vectors), 67% (BM25 plus vectors plus reranker). Each step adds quality and complexity. Buy them in order, only when needed.

Private note

This is the part of the article that students should walk away with. Not the BM25 formula, not the cost table, not the Anthropic numbers. The procedural prescription: start simple, deploy, measure, add complexity only when you can name what it solves. Building elaborate RAG stacks before measurement is the most common form of premature optimization in this corner of the field. The teams that ship are the teams that ship the forty lines and then add what they measure to be necessary.

. . .

The Bottom Line

Most RAG tutorials open with a vector database. They are not wrong about what is technically possible, but they are starting from the wrong default. The default starting point should be BM25 plus LLM. Everything else is an addition you make when you measure that BM25 fails on your data.

What you give up by starting with BM25: paraphrase coverage on the queries where it matters. What you give up by starting with a vector DB: simplicity, cost discipline, low latency, exact-match correctness, explainability, multilingual support without re-indexing, and an operational story your team understands.

The argument is empirical, not aesthetic. Anthropic's own Contextual Retrieval recipe uses BM25 as one component, Elastic's own RAG positioning lists "textual, vector, hybrid, or semantic search" as co-equal retrieval modes, and IBM Watson Discovery packages BM25 plus LLM as a managed service. The most sophisticated production RAG systems use BM25 alongside vectors, not instead of them.

The simplest RAG is BM25 plus LLM; the best is BM25 plus vectors plus reranker, layered on after measurement. The worst is the one you build before you have measured what your retrieval actually needs.

. . .

References

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020.
Robertson, S., Walker, S., Hancock-Beaulieu, M., Gull, A., & Lau, M. (1994). "Okapi at TREC-3." TREC 1994. Cited via Lucene BM25Similarity JavaDoc.
Apache Lucene project. BM25Similarity (Apache Lucene 8.11.0 API).
Elastic. "Similarity module." Elasticsearch Reference.
Turnbull, D. (2018). "Practical BM25, Part 2: The BM25 Algorithm and its Variables." Elastic Blog.
Elastic. "Match query." Elasticsearch Reference.
Elastic. "Multi-match query." Elasticsearch Reference.
Elastic. "Boolean query." Elasticsearch Reference.
Elastic. "Paginate search results." Elasticsearch Reference.
Elastic. "Highlighting." Elasticsearch Reference.
Elastic. "elasticsearch-py Getting Started." Official Python client documentation.
turbopuffer engineering blog. (2025). "Why BM25 queries with more terms can be faster (and other scaling surprises)."
Query Quotient. (2025). "Elasticsearch Query Performance Optimization Guide 2025."
Anthropic. "Prompting best practices." Claude API Documentation.
Anthropic. "Retrieval Augmented Generation guide." Claude Cookbooks.
Anthropic. "Context windows." Claude API Documentation.
Anthropic. "Pricing." Claude API Documentation.
Anthropic. (2024). "Introducing Contextual Retrieval."
Artificial Analysis. "Claude Sonnet 4.6." Model latency benchmarks.
Elastic. "Reciprocal rank fusion." Elasticsearch Reference.
Cormack, G. V., Clarke, C. L. A., & Büttcher, S. (2009). "Reciprocal rank fusion outperforms condorcet and individual rank learning methods." SIGIR 2009.
Haystack (deepset). "ElasticsearchBM25Retriever." Haystack Documentation v2.29.
LangChain. "ElasticsearchRetriever." LangChain Documentation.
Fox, P. (2024). "Doing RAG? Vector search is not enough." Microsoft Azure Developer Community Blog.
Sawarkar, K., Mangal, A., & Solanki, S. R. (2024). "Blended RAG: Improving RAG Accuracy with Semantic Search and Hybrid Query-Based Retrievers." arXiv:2404.07220.
Thinking Loop. (2025). "When to Ditch Your Vector DB for Simple BM25." Medium.

RAG Elasticsearch BM25 Retrieval LLM