← Back to article

Sources

Grounding, citations, and verification notes for What Classic Search Does Before the LLM.

All of this is optional. The article is self-contained. This page exists so that the claims about Lucene, BM25, and Elasticsearch's modern hybrid stack are properly cited and so that anyone who wants to verify a specific claim knows where to look.

Each numbered entry below links back to the exact point in the article where its claim appears, and forward to the primary source. The companion sources page exists in part because a reader asked, fairly, whether "the engine still uses an inverted index" is in fact how Elasticsearch works in 2026. That question is answered with sources 2 through 4.

About the Sources

Elastic Blog and Elasticsearch Reference

Primary documentation from the company that builds Elasticsearch.

Most of the architectural claims about Lucene's index structures, BM25 parameters, ELSER, semantic_text, and hybrid retrieval cite Elastic's own engineering blog or the Elasticsearch Reference. These are the closest thing the field has to authoritative documentation for the engine, written by the engineers who maintain it.

Apache Lucene API and Source

The library underneath Elasticsearch, OpenSearch, and Solr.

BM25 is implemented in Lucene's BM25Similarity class, and the inverted index file format is described in Lucene's codec source. When a claim in the article concerns the algorithm itself rather than how Elasticsearch surfaces it, the citation points at Lucene rather than Elastic.

Practitioner Walkthroughs

Independent engineering blogs by people who have used Lucene in anger.

Blaszyk (j.blaszyk.me), mocobeta (GitHub), Codecurated, and Prithv (dev.to) walk through Lucene's data structures and Elasticsearch's analyzer pipeline at a level of detail Elastic's marketing-adjacent material rarely matches. These are secondary sources, but they are the secondary sources practitioners actually read.

Companion Articles in the Same Course

Trim (2026). Same Week 5 sequence for COSC-650.

Two of the references are sibling articles in the same week of the course: one on retrieval-quality measurement, one on embedding-model selection. Neither covers classic search directly, but both build on the lexical floor this article describes.

What Happens Between the Query Box and the List

1The analyzer is char filters, then a tokenizer, then token filters ↩ back to article

Codecurated's walkthrough lays out the analyzer's three-stage pipeline cleanly: character filters run first (optional, often zero), exactly one tokenizer runs second (required), and any number of token filters run last. The same analyzer runs on both the document at index time and the query at lookup time, which is why a document containing Wellbore and a query containing wellbore end up matching: both reduce to the same indexed term after the lowercase filter runs.

Codecurated. "Introduction to Analyzer in Elasticsearch." Read

2Elasticsearch still uses Lucene's inverted index as the lexical retrieval structure ↩ back to article

The Elastic engineering blog states the architecture directly: "the inverted index maps terms to documents (and possibly positions in the documents) containing the term." The structure has two components, a sorted term dictionary and a posting list per term, and queries proceed by "looking up all the terms and their occurrences, and tak[ing] the intersection (for AND searches) or the union (for OR searches)." This is the article's central claim about Elasticsearch's lexical path, and the Elastic blog confirms it as the architecture the engine continues to use in 2026 alongside dense_vector and ELSER, not in place of them.

Elastic. "Elasticsearch from the Bottom Up, Part 1." Read

3The inverted index is one of several structures Lucene maintains per segment ↩ back to article

Blaszyk's 2023 walkthrough enumerates the on-disk structures Lucene writes per segment: the inverted index (term dictionary plus postings), DocValues (column-oriented per-document storage for sort, facet, and aggregation), stored fields (row-oriented retrieval of original field values), and write-once segments that periodically merge. The article's body focuses on the inverted index because that is the structure BM25 reads, but the engine also writes DocValues, stored fields, HNSW graphs for dense_vector fields, and BKD trees for numerics in the same indexing pass.

Blaszyk, J. (2023). "Exploring Apache Lucene, Part 1: The Index." Read

4Posting lists are sorted by doc_id and stored in 128-doc blocks ↩ back to article

The mocobeta repository documents Lucene's default postings format with binary-layout diagrams. Postings within a list are sorted by document ID and grouped into blocks of 128 documents, with per-block metadata (including the max impact score introduced in Lucene 8) that supports block-skipping during top-k retrieval. This is the layout that lets an intersection or union of two posting lists complete in a single merge pass, which the article describes as "linear-time" but is, in practice, sublinear once block-max skipping kicks in.

mocobeta. "Lucene postings format diagrams." Read

5BM25 became Lucene's default in Lucene 6 (2016) ↩ back to article

Turnbull's 2015 OpenSource Connections post anticipated and explained the switch from Lucene's tuned-TF-IDF default to BM25, which landed in Lucene 6 in 2016 and was picked up by Elasticsearch 5.0 the same year. The post is useful as a primary source on the rationale: BM25's saturating term-frequency and explicit length normalization fix the pathologies that the older vector-space scoring exhibited on long or term-heavy documents.

Turnbull, D. (2015). "BM25: The Next Generation of Lucene Relevance." Read

6The reference implementation: Lucene's BM25Similarity class ↩ back to article

The Lucene API documentation specifies the exact formula used during scoring: a per-term sum of an IDF factor, a saturating TF factor parameterized by k1, and a length-normalization factor parameterized by b. The class-level Javadoc cites Robertson and Zaragoza's "The Probabilistic Relevance Framework: BM25 and Beyond" as the algorithmic source. Anything an article says about BM25's mechanics should reduce to what this class does at scoring time.

Apache Lucene. "BM25Similarity (Lucene API)." Read

7What k1 and b actually do ↩ back to article

Elastic's "Practical BM25" Part 2 walks through the algorithm at a practitioner level: k1 controls how quickly the term-frequency reward saturates (higher = slower saturation, repeated terms keep adding signal), and b controls the strength of the length-normalization penalty (0 = ignore length entirely, 1 = full normalization). The article uses this characterization directly in the three-properties summary of BM25's scoring shape.

Elastic. "Practical BM25, Part 2: The BM25 Algorithm and its Variables." Read

8Why k1 = 1.2 and b = 0.75 are the defaults ↩ back to article

Part 3 of the same Elastic series justifies the specific default values. The numbers trace back to Robertson and Sparck Jones's original Okapi BM25 work and have held up across decades of empirical evaluation on diverse text corpora. Elastic recommends leaving them alone unless there is a measured reason to tune them, which matches the article's claim that "most teams never change them."

Elastic. "Practical BM25, Part 3: Considerations for Picking b and k1 in Elasticsearch." Read

9WAND, block-max WAND, and sub-millisecond top-k ↩ back to article

Prithv's dev.to walkthrough explains the family of skip-list and block-max optimizations that make top-k retrieval on a billion-document corpus tractable. The core idea: maintain a min-heap of size k while scoring, and use per-block max-impact metadata to skip entire blocks of postings that cannot exceed the current heap floor. The article cites this for the "single-digit milliseconds" claim about modern lexical retrieval.

Prithv. "Inverted Index Explained: How Elasticsearch Achieves Sub-Millisecond Search on Billions of Documents." Read

Classic Search in 2026 Is Not Just Lexical

10HNSW defaults and the semantic_text auto-pipeline ↩ back to article

The Elasticsearch Reference page for semantic_text documents the defaults the article cites verbatim: HNSW with m = 16 and ef_construction = 100, 250-word chunks with 100-word overlap for long-document chunking, and reciprocal rank fusion as the recommended default for combining lexical and vector results. The "semantic_text simplifies semantic search by providing sensible defaults" framing is the Elastic Reference's own.

Elastic. "Semantic text field type." Read

11ELSER vocabulary, sparsity, and inverted-index reuse ↩ back to article

Elastic's ELSER documentation specifies the model's parameters: a learned sparse encoder over a vocabulary of approximately 30,000 terms, with each document and query expanded into a weighted sparse vector where more than 99.9% of values are zero. Because the vocabulary is fixed and the representation is sparse, ELSER vectors are stored in a sparse_vector field and retrieved with the same inverted-index machinery that BM25 uses, just with learned per-term weights replacing TF-IDF-derived ones.

Elastic. "ELSER: Elastic Learned Sparse Encoder." Read

12Hybrid search combines lexical and vector retrieval in one engine ↩ back to article

Elastic's "What is hybrid search" page frames hybrid retrieval as the production default for systems that care about both exact-match precision and paraphrase recall. The page also surfaces the scale-mismatch problem (BM25 scores and cosine similarity scores live on different scales, so naive score-summing does not work) and points at reciprocal rank fusion as the standard fix.

Elastic. "What is hybrid search? How it works and when to use it." Read

13Reciprocal rank fusion: parameter-light, calibration-free ↩ back to article

The Elasticsearch Labs blog on hybrid search confirms the RRF formula used in production: each document's contribution from a retriever's list is 1 / (k + rank) with a smoothing constant k commonly set to 60. The constant is a single knob with little practical sensitivity, which is why Elasticsearch makes RRF the default fusion method when multiple retrievers participate in a single query.

Elastic. "Elasticsearch hybrid search: Overview and hybrid search queries." Read

14Three retrievers sharing one engine ↩ back to article

The "Lexical and semantic search with Elasticsearch" post is the cleanest side-by-side comparison Elastic publishes. It walks the same query through lexical (BM25 over an inverted index), dense (HNSW over a dense_vector field), and sparse-neural (ELSER over a sparse_vector field) retrievers, with the result that each is strong on a different query distribution and weak on the others. This grounds the article's claim that the three retrievers are complementary, not substitutes.

Elastic. "Lexical and semantic search with Elasticsearch." Read

Why BM25 Still Wins Where It Wins

15BM25 as the precision layer for production RAG ↩ back to article

The Redis engineering blog argues, with examples, that "a RAG pipeline without BM25 in the retrieval stack will have systematic blind spots." Exact-match cases (SKUs, error codes, CLI flags, ticket IDs, the kinds of technical identifiers that an embedding model dilutes into noise) are where BM25 catches what vector retrieval misses. The article uses this framing in the "Why BM25 Still Wins" table, which mirrors the strength-categories Redis identifies.

Redis. (2024). "Full-text search for RAG apps: BM25 and hybrid search." Read

16The dense-vector retriever, in detail ↩ back to article

The companion article in this Week 5 sequence handles the embedding-model layer directly: how to choose a model, how dimension affects retrieval, when to fine-tune, and how the MTEB benchmark family translates (and fails to translate) to a specific domain. Where the article you are reading covers the lexical floor, "Embedding Models for RAG" covers the layer that sits next to it in a modern hybrid stack.

Trim, C. (2026). "Embedding Models for RAG: Selection, Evaluation, and Fine-Tuning." Read

17How to measure what the retrieval step produces ↩ back to article

The article you are reading describes the retrieval mechanism. The companion "Retrieval Quality Problem" article describes how to evaluate it: precision and recall for retrieval (not just for the LLM), reciprocal rank fusion as a debuggable fusion method, and the case for stratifying metrics by document type and query category. Together the two pieces cover the mechanism and the measurement.

Trim, C. (2026). "The Retrieval Quality Problem." Read

Where Classic Search Sits in the RAG Stack

18Demo specification: classic search vs RAG synthesis ↩ back to article

GitHub Issue #118 in the COSC-650 repository specifies the demo embedded at the top of the article (Demo 1, classic-search-walkthrough) and the companion that adds an LLM synthesis layer (Demo 2, simple-rag-walkthrough). The issue contains the full result list, the synthesized answer used in Demo 2, the animation pacing decisions, and the rationale for using the same query in both demos so the audience can see exactly what the LLM layer adds. The pre-canned data is intentional: the point of the demos is the flow, not the backend.

COSC-650 Repository. (2026). "Demo series: classic search vs RAG synthesis." Read