← All Articles

What Classic Search Does Before the LLM

Classic search is not pre-AI. It is the lexical retriever that still sits underneath most production RAG systems in 2026, in the same engine that now also runs dense vectors and sparse neural models. The query goes through an analyzer, hits an inverted index, gets scored by BM25, and a top-k list comes back. Before any of that turns into a synthesized answer, the engineering of the retrieval step decides what the answer can be.

In Brief

Classic search did not get replaced by vector search; it got joined by it. In every major production RAG system in 2026, the lexical retriever is still doing the load-bearing work: an analyzer normalizes the query into tokens, those tokens hit an inverted index built from the corpus, BM25 scores the candidate documents, and a top-k list comes back. The same Elasticsearch engine that now runs dense vectors and sparse neural models started life as a lexical search engine, and the lexical path is still what most queries take. Understanding how it works is the prerequisite for understanding when vector retrieval helps and when it gets in the way.

This article walks the four moving parts: the analyzer pipeline that decides what counts as a token, the inverted index that turns the corpus into a structure you can query, the BM25 formula with its k1 and b parameters that controls how scores are computed, and the top-k engineering that decides what makes it into the LLM's context. The practical heuristic the article closes on: if the queries your users send carry exact-match expectations (a code, a part number, a name), BM25 belongs in the stack and hybrid retrieval is the default. Pure-vector is the special case, not the starting point.

A drilling engineer types stuck pipe with wellbore instability into a search box and hits enter. Ten results come back, ranked, with snippets and source attributions. The engineer reads the top three, opens two of them, and starts piecing together an answer about reactive shale and mud weight windows. This is the workflow every technical search system has supported for two decades.

In 2026 the same engineer might also paste that question into a chat box and get a synthesized answer back. That experience feels like a different category of product. The retrieval step underneath it, however, is often unchanged. Top-k from a lexical search index feeds the LLM context, the LLM does the synthesis, and the user reads a paragraph instead of ten links. The retrieval is the same retrieval the engineer was doing in 2010.

This article is about that retrieval. The demo below plays itself, so you can watch what the user sees. The body of the article then walks the engineering that turned the query into the list.

The query types itself in, the skeleton loader resolves, and ten ranked results stagger down the page. No synthesis, no LLM, no vector search.

. . .

What You Just Watched

The demo runs end to end in about six seconds. A query (stuck pipe with wellbore instability) types into the search box. A short skeleton loader fills the result area. Ten cards drop in, top to bottom, each with a title, a source attribution (SPE, Schlumberger, PetroWiki, and so on), and a snippet with the matched terms bolded inline. Then a Replay button appears.

There is no chat box and no synthesis, and nothing that an LLM would call an "answer." The user gets a ranked list of documents and is expected to read them.

That last point matters and the rest of the article keeps coming back to it. The retrieval step did its job and the list is relevant, but the cognitive work of putting the answer together is still the user's. Classic search is a tool for finding documents, not for producing answers.

What the audience cannot see in the demo is the engineering between the search box and the list. The query did not travel intact to a database. It was tokenized, lower-cased, possibly stemmed, looked up against an inverted index, scored by BM25 against every candidate document, and then truncated to the top ten by that score. Each one of those steps is a decision that shaped what the engineer ended up reading. The next section opens that pipeline.

. . .

What Happens Between the Query Box and the List

The reference implementation for classic search in 2026 is Lucene, the underlying library, and Elasticsearch (or OpenSearch), the distributed engine wrapped around it. Almost every commercial search product in the technical-document space is one of these two with custom analyzers and a custom retrieval pipeline on top. The mechanism is worth describing precisely because it determines what the user can and cannot find.

The Analyzer

Before any document is indexed, its text passes through an analyzer. The analyzer is a three-stage pipeline: character filters, a tokenizer, and a sequence of token filters. The same pipeline runs again on every query before lookup, which is why the query and the documents end up speaking the same vocabulary.¹

Character filters strip or rewrite bytes before tokenization. They handle things like HTML entity decoding, replacing curly quotes with straight ones, or normalizing unicode forms so café indexed in NFC matches a query of café typed in NFD. They are optional, and most analyzers use zero of them.

The tokenizer is the only required stage. It splits the input into tokens. The default standard tokenizer follows the Unicode Text Segmentation specification, which gets the obvious word boundaries right and handles punctuation gracefully. For the query stuck pipe with wellbore instability it produces five tokens: stuck, pipe, with, wellbore, instability. For a sentence containing U.S.A. or SLB-2024-117 the choice of tokenizer matters a great deal, because the default will split on the punctuation that the domain treats as significant.

Token filters then operate on the token stream. The defaults lowercase everything, drop a small list of English stop words, and (depending on configuration) apply stemming or lemmatization. After filtering, Wellbore Instabilities in a document and wellbore instability in a query both reduce to the same indexed terms. This is what makes classic search robust to the surface variation that every real text corpus contains.

The analyzer is the part of the system the rest of the pipeline assumes. Get the analyzer wrong (use a stemmer that strips technical suffixes the domain depends on, or fail to handle compound technical identifiers) and the inverted index downstream will be populated with the wrong terms. The query will never recover from that.

The transformation an analyzer performs is mechanical and step-by-step. Each yellow box below is a stage of the analyzer, and the bands between them show the data after that stage has run. The four green chips at the bottom are what ends up in the inverted index.

The Inverted Index

Once tokenized and filtered, every term is added to the inverted index. The data structure is the simplest part of the whole system to describe and the hardest to build at scale.²

For each unique term in the corpus, the inverted index stores a posting list: the set of document IDs the term appears in, along with positions, term frequencies, and any other per-occurrence data the engine needs for scoring. A query is processed by analyzing it into terms, looking up the posting list for each term, intersecting or unioning those lists depending on the query semantics, and producing a candidate document set to score.³

# Conceptual shape of the inverted index for a tiny corpus.
# Each term maps to (doc_id, term_frequency) pairs.

stuck       -> [(d1, 3), (d2, 2), (d3, 1), (d5, 4), ...]
pipe        -> [(d1, 2), (d2, 3), (d3, 1), (d9, 1), ...]
wellbore    -> [(d1, 4), (d3, 5), (d4, 2), (d8, 6), ...]
instability -> [(d1, 2), (d3, 4), (d8, 3), ...]
shale       -> [(d2, 6), (d5, 3), ...]
pack-off    -> [(d3, 2), (d4, 4), ...]

The posting lists are sorted by doc_id so intersections and unions can be computed in linear time. A query like stuck AND pipe AND wellbore AND instability finds documents in the intersection of the four posting lists. A query like the one the demo runs (which Elasticsearch interprets as an OR of the terms by default) finds documents in the union, with the scoring step deciding which union members rank above which others. The intersection or union itself is fast: posting lists are sorted, so a merge proceeds in a single pass.⁴

The interesting work is not the lookup. It is the scoring.

BM25 Scoring

BM25 has been Lucene's default similarity function since Lucene 6, which shipped in 2016, with Elasticsearch 5.0 adopting it the same year. Before BM25 the default was a tuned variant of TF-IDF using vector-space scoring. The switch was years overdue when it landed: BM25 had been the well-supported state of the art in the information retrieval literature since the late 1990s and produced visibly better rankings on most corpora out of the box.⁵

BM25 scores a document for a query by summing, across each query term, a function of three things: how often the term appears in the document (term frequency), how rare the term is in the corpus (inverse document frequency), and how long the document is relative to the corpus average (length normalization). The formula has two tunable parameters: k1 controls how quickly the term-frequency reward saturates, and b controls how aggressively long documents are penalized. Elasticsearch ships with k1 = 1.2 and b = 0.75, and most teams never change them.⁶⁷⁸

The shape of the function matters more than the exact algebra. Three properties are the ones to internalize:

Saturating term frequency. A document that contains a query term twice is worth more than one that contains it once. A document that contains it twenty times is worth only marginally more than one that contains it ten. The reward curve flattens. This is the fix BM25 makes over naive TF-IDF, which had pathologies where a document repeating the query term a hundred times would dominate the ranking purely by repetition.

Inverse document frequency. Rare terms are weighted more heavily than common terms. wellbore appears in a tiny fraction of an oil-and-gas corpus and gets a large weight. with appears in nearly every document and gets a near-zero weight (or is dropped entirely by stop-word filtering). The two query tokens stuck and instability drive the ranking; with contributes almost nothing.

Length normalization. A 400-word page that mentions stuck pipe three times is worth more than a 40,000-word book that mentions it three times. The book's three mentions are diluted across more text. The page concentrates the relevance signal. The b parameter controls how strict this penalty is; at b = 0.75, a document twice the average length needs about 1.5x the term frequency to score the same.

This is what classic search means in 2026. The mechanism is well-characterized, the parameters are stable, and the score it produces is a single number per (query, document) pair that the engine then sorts on.

Top-K Selection

Sorting a million BM25 scores to take the top ten is wasteful when most of those scores will not affect the top ten. Lucene maintains a min-heap of size k during scoring and only inserts a candidate into the heap when its score exceeds the current minimum. Combined with techniques like WAND and block-max WAND that skip documents whose theoretical maximum score cannot exceed the current heap floor, top-k retrieval on a corpus of hundreds of millions of documents finishes in single-digit milliseconds.⁹

The ten cards in the demo are the survivors of that process. Result ten beat result eleven by some BM25 margin, and the engine simply decided the user did not need to see result eleven or anything beyond it. That same decision is made for billions of queries a day across the engines descended from the same Lucene codebase, on math that has been stable for two decades and still works.

. . .

The Pipeline, End to End

The diagram below sketches the full path from the user's query to the ten cards in the demo. The same path runs in reverse during indexing, when each ingested document is analyzed and its tokens are written into the index.

The query path runs in milliseconds; the indexing path runs once per document and persists. Both paths share the analyzer stage, which is what guarantees they produce the same vocabulary, and that shared vocabulary is what lets BM25 score a query against a corpus the user has never seen.

. . .

Classic Search in 2026 Is Not Just Lexical

The pipeline described above is the lexical retriever. It is what Elasticsearch has always done. In the last three years it has been joined inside the same engine by three more retrievers that share the index, the query language, and the scoring infrastructure but operate on a different signal.¹⁴

Dense Vectors

A dense_vector field stores a fixed-length float array per document, produced by an embedding model at indexing time. At query time the engine embeds the query with the same model and runs an approximate nearest neighbor search, typically via the HNSW (Hierarchical Navigable Small World) graph index. Elasticsearch ships HNSW with default parameters m = 16 (neighbors per graph node) and ef_construction = 100 (candidates considered during graph construction). These defaults are the ones the Apache Lucene HNSW implementation uses and produce respectable recall on most corpora.¹⁰

The query for stuck pipe with wellbore instability embedded as a dense vector and searched against an HNSW graph will return documents that are semantically similar in the embedding space. It will find documents that talk about pack-off and hole collapse without those exact tokens appearing in the query, because those concepts are close to the query's concepts in the vector space. It will also find documents that talk about completely different mechanisms (cementing problems, casing failures) when their language is structurally similar enough to confuse the embedding.

Vector search is good at paraphrase and concept matching. It is unreliable at exact identifier matching, because embedding models tend to dilute the specificity of tokens that look like noise to them. Searching for SPE-178843-MS against a vector index will rarely return the paper with that ID at the top.¹⁶

Sparse Neural: ELSER and sparse_vector

The middle path between lexical and dense vector is a sparse neural model. Elasticsearch ships ELSER (Elastic Learned Sparse EncodeR), a model trained to expand a piece of text into a weighted sparse vector over a vocabulary of about 30,000 terms. Each document and each query is expanded into hundreds or thousands of non-zero weights, but the total vocabulary is small enough that the storage and lookup pattern remains sparse: more than 99.9% of the values in any single vector are zero.¹¹

The benefit is that a sparse_vector field can be searched with the same inverted-index machinery that BM25 uses, just with learned per-term weights instead of TF-IDF-derived ones. The interpretation is closer to "BM25 with a smarter vocabulary" than to "vector search lite". Each non-zero dimension of an ELSER vector corresponds to a token, and the weights can be read by a human. This is the property dense vectors give up.

ELSER catches paraphrase, like dense vector search does. It also retains the keyword-exactness that lexical search depends on, because the original tokens are still present in the expansion. For technical retrieval where both kinds of match matter, ELSER often outperforms either pure approach.

semantic_text: The Auto-Pipeline

The newest layer is semantic_text, a field type that handles embedding generation, chunking, and storage automatically. The defaults are aggressive: long documents are split into 250-word sections with a 100-word overlap, each section is embedded, and the resulting vectors are stored against the chunk along with the parent document ID. Hybrid search across semantic_text fields is a single query, with reciprocal rank fusion (RRF) blending the lexical and vector results into a unified ranking.

The implication is that the engine that runs the demo on the previous page can also, with the same query language and almost the same index, run a hybrid search that combines BM25 over the analyzed text with a dense-vector search over a learned embedding and an ELSER search over a learned sparse representation. The choice of which retrievers participate is a configuration decision, not a platform decision. Classic in 2026 is less a category of engine and more a choice within an engine.

Hybrid Retrieval via RRF

When two or three retrievers are running over the same corpus, their results have to be combined into a single ranked list. The naive approach (sum the scores) does not work because BM25 scores and cosine similarity scores live on different scales. Calibration is possible but brittle.¹²

Reciprocal rank fusion sidesteps the calibration problem entirely. For each retriever's list, each document's contribution is 1 / (k + rank), where rank is its position in that list and k is a smoothing constant (commonly 60). The final score for a document is the sum of these contributions across all retrievers. The formula has one tunable parameter; in practice nobody tunes it.¹³

The benefit is not just that RRF works; it is that RRF is auditable. A document's fused score can be traced back to its rank in each underlying retriever, and a missing document can be diagnosed retriever by retriever. This is the kind of transparency that pure score addition does not give you and that production retrieval teams end up needing the day a user reports that the obviously-correct document is not showing up.

. . .

Why BM25 Still Wins Where It Wins

The natural question after a tour of dense vectors and sparse neural models is whether classic search still has a job to do. In 2026 the answer is straightforwardly yes. Four reasons keep BM25 in the production stack of nearly every retrieval system that takes itself seriously.

Strength	What it covers	Where vectors struggle
Exact identifier match	SKUs, part numbers, error codes, well names, ticket IDs	Embedding models treat identifiers as noise tokens and dilute them
Rare technical vocabulary	Domain jargon, acronyms, named methods, proper nouns	Out-of-distribution tokens get unreliable embeddings
Auditable ranking	Each document's score traces to TF, IDF, and length terms	Vector scores have no human-readable derivation
Latency and cost	Sub-millisecond on hundreds of millions of documents, no GPU	Vector retrieval needs an HNSW index in RAM and an inference budget

The point is not that vectors are bad. It is that the strengths of the two approaches are complementary, and that the production system that uses both does better than either alone. The pure-vector RAG architecture remained popular for a single year between roughly 2022 and 2023, and the field has been quietly walking that back since.

To make the failure mode concrete: imagine a drilling engineer who read SPE-178843-MS yesterday, remembers the paper exists, and wants to find it today. That is not a paraphrase query, it is a literal-string query for a specific document the user already knows is in the corpus. A pure-vector retriever will hand back ten papers that are conceptually similar to the one they want, none of which is the one they want, and there is no signal in the result list that anything went wrong. The retrieval looked fine, but it was not the retrieval the user needed, and the user has no way to recover from that failure without leaving the search experience entirely.

The Heuristic for Week 5

Do exact-match expectations exist in your query distribution?

If yes, and in nearly every production workload they do, you need precision and explainability, with enough control over scoring to debug a wrong result. BM25 belongs in the retrieval stack and hybrid (with RRF as the default fusion) is the right architecture. The lexical layer is where exact identifiers, error codes, technical jargon, and known-document recall come from. None of that survives a pure-vector pipeline.

If no, and every query is a conceptual paraphrase that never has to land on a specific known document, pure-vector is defensible. That workload is real but rare, and it is not the default a production RAG system should reach for first.

This is the consensus, including in Elasticsearch's own documentation: hybrid retrieval with RRF is the default. Pure-vector is the special case, not the starting point. The cost of getting that wrong is not a metric drop on an evaluation dashboard; it is the engineer who knew the exact paper they wanted, could not find it through a vector retriever, and either gave up or routed around the system. Lexical retrieval is what prevents that outcome.

For the drilling engineer searching for stuck pipe with wellbore instability, both modes are relevant. The query has technical jargon (wellbore) that benefits from lexical exact match. It also has a conceptual relationship between the two phrases that a vector retriever can pick up where BM25 cannot. A hybrid system, in practice, ranks the top three documents the same way BM25 alone does and rescues documents at ranks 8 through 20 that BM25 alone would have missed.¹⁵¹⁷

. . .

What the Demo Abstracts

The walkthrough at the top of the page is honest about what it is showing and quiet about what it is hiding. The honesty is in the user-facing flow: query types in, list comes back, no synthesis. The quiet is everything in the middle, which is most of the engineering.

The analyzer is not visible. The user does not see that stuck and stuck pipe were tokenized identically, that wellbore and Wellbore reduced to the same indexed term, or that instability may have been stemmed to instabl depending on which analyzer the corpus is using. These choices were made by whoever configured the index, and changing them would change what the demo returns.

The inverted index is not visible. The user does not see that result 1 (the SPE paper) and result 8 (the SPE Journal paper on geomechanical models) both come from the same publisher but live in completely separate sections of the posting lists. They share posting lists for stuck, pipe, wellbore, and instability, and that is what put them both into the candidate set.

The BM25 scores are not visible. The user sees a ranked list but not the score that put each document where it is, nor the score distance between adjacent results. In a real engineering interface, exposing the scores (or the rank gaps) helps debug "why is this one above that one" questions; the demo deliberately strips this so the audience focuses on the user-facing flow.

The top-k truncation is not visible. The list has ten cards while the underlying corpus has thousands of candidate documents, and results 11 through 1000 exist but the engine decided the user did not need to see them. In a production setting the choice of k is a substantial design decision: too small and relevant documents get cut; too large and the user (or, in a RAG setting, the LLM context budget) cannot consume them.

All of this is fine. The demo is teaching a mental model, not an implementation. Once the mental model is in place, the engineering details slot into it without rewriting the picture.

. . .

Where Classic Search Sits in the RAG Stack

The retrieval-augmented generation pattern, in its most common form, is: take the user's query, retrieve the top-k documents from a search index, place those documents in the LLM's context window, and ask the LLM to answer the question grounded in that context. The retrieval step is the search step, and it is the same retrieval step the demo just walked through.

This is what is missed by the version of RAG that gets taught most often, the one where the retriever is always a vector store and the embedding model is always an OpenAI or Cohere production model. That version is one configuration among several. The retrieval step does not have to be vector search; the LLM does not know or care which retriever produced its context. A classic BM25 top-k feeds an LLM just fine. The companion demo (simple-rag-walkthrough) replays the classic-search demo and bolts an LLM synthesis layer on top, which is the smallest possible RAG architecture and works.¹⁸

The hybrid systems described above are the production default for one reason: the LLM's output quality is bounded by the relevance of the context it received. If the retriever did not surface the document that contains the answer, no amount of prompt engineering or model upgrades will recover it. The retriever decides what the LLM can possibly say. Spending the cost on a hybrid retriever that catches both exact-match and paraphrase cases pays back across every downstream application.

Classic search is the floor that RAG stands on. In 2026, it is also one of the three retrievers running inside the same engine that does the vector and sparse-neural work. The article that this draft is part of is the first in a Week 5 sequence that walks the audience from this floor up through hybrid retrieval, reranking, and structured-data retrieval. The next step in that sequence is the LLM synthesis demo and its companion article, which add the generation layer on top of exactly the same ten results that the walkthrough at the top of this page returned.

That is the whole stack at its smallest: classic search returns the list, and the LLM reads the list and writes the paragraph. Everything else (vector retrievers, rerankers, query rewriting, multi-hop, structured-data routing) is an optimization on top of that two-step structure. The mental model the walkthrough teaches is the load-bearing one.

. . .

References

Primary-source grounding, chapter-level citations, and annotation for each numbered reference live on the companion sources page.

Codecurated. (Current). "Introduction to Analyzer in Elasticsearch." Character filter, tokenizer, and token filter walkthrough.
Elastic. (2014). "Elasticsearch from the Bottom Up, Part 1." Foundational walkthrough of Lucene's inverted index, term dictionary, and posting list architecture.
Blaszyk, J. (2023). "Exploring Apache Lucene, Part 1: The Index." The on-disk structures Lucene maintains per segment: inverted index, DocValues, stored fields, and write-once segments.
mocobeta. (Current). "Lucene postings format." At-a-glance overview diagrams of the default Lucene posting list binary format, including the 128-document block structure.
Turnbull, D. (2015). "BM25: The Next Generation of Lucene Relevance." OpenSource Connections. Context on why BM25 replaced the TF-IDF default in Lucene 6.
Apache Lucene. (Current). "BM25Similarity (Lucene API)." Reference implementation of BM25 as Lucene's default similarity function.
Elastic. (Current). "Practical BM25, Part 2: The BM25 Algorithm and its Variables." The k1, b, and length-normalization parameters explained for practitioners.
Elastic. (Current). "Practical BM25, Part 3: Considerations for Picking b and k1 in Elasticsearch." Why the defaults k1 = 1.2 and b = 0.75 hold up across most corpora.
Prithv. (Current). "Inverted Index Explained: How Elasticsearch Achieves Sub-Millisecond Search on Billions of Documents." dev.to walkthrough of skip lists, WAND, and block-max optimization.
Elastic. (Current). "Semantic text field type." Elasticsearch Reference. Auto-pipeline for chunking, embedding, and hybrid retrieval, including HNSW defaults.
Elastic. (Current). "ELSER: Elastic Learned Sparse Encoder." Sparse neural model, vocabulary, and sparse_vector field type.
Elastic. (Current). "What is hybrid search? How it works and when to use it." Native lexical-plus-vector retrieval with RRF as the default fusion.
Elastic. (Current). "Elasticsearch hybrid search: Overview and hybrid search queries." Elasticsearch Labs. Production-oriented description of the hybrid retriever syntax.
Elastic. (Current). "Lexical and semantic search with Elasticsearch." Side-by-side comparison of the three retrievers inside one engine.
Redis. (2024). "Full-text search for RAG apps: BM25 and hybrid search." Argument for BM25 as the precision layer in production RAG pipelines.
Trim, C. (2026). "The Retrieval Quality Problem." Precision and recall for retrieval, RRF, and stratified evaluation. The companion piece on measuring what the retrieval step produces.
Trim, C. (2026). "Embedding Models for RAG: Selection, Evaluation, and Fine-Tuning." The dense-vector retriever in detail. Where this article handles the lexical floor, that one handles the embedding layer.
COSC-650 Repository. (2026). "Demo series: classic search vs RAG synthesis." Demo specification, data, and animation timing for the walkthrough embedded above.

Classic Search Elasticsearch BM25 Inverted Index Information Retrieval Hybrid Search RAG