← All Articles

Fixing the Query: LLM-Driven Transformation over BM25

Most user queries are wrong for the index they are searching. The alternative to swapping BM25 for a dense vector retriever: keep BM25 on the corpus side, put an LLM on the query side. This article surveys the six query-transformation patterns that family produces, the fusion layer that ties them together, and the cost ledger that decides when each one earns its place.

The companion articles in Week 5 already cover the two ends of this pipeline. The article on classic search walked the lexical retriever itself: how an analyzer turns text into tokens, how an inverted index stores those tokens, how BM25 scores documents against a query. The article on measuring retrieval walked the evaluation discipline that tells you whether your retriever is working, and gestured at this article in its section on "When Measurement Reveals a Problem." The piece you are reading now is the deep treatment of that gesture: what to do when the measurement loop tells you the retriever is fine but the queries are not.

The framing matters because it determines what you optimize. If you decide the retriever is the problem, you reach for a model swap, a hybrid architecture, a reranker. If you decide the query is the problem, you reach for the patterns in this article. The two decisions are not mutually exclusive, and a mature retrieval system in 2026 typically does both. But the query-side intervention is cheaper, faster to ship, and respects the closed-loop discipline that BM25 evaluation gives you: you change one component, measure the delta on your judged eval set, keep the variant that wins. No new index, no new vendor, no new infrastructure. Just an LLM call in front of the existing retriever.

. . .

The Shape of the Problem

Every BM25 miss is one of two shapes. The first shape is the under-specified query: a two- or three-word phrase that names a topic but does not carry enough lexical signal for BM25 to discriminate among thousands of candidate documents. The second shape is the over-specified query: a long natural-language sentence whose surface form does not match how the indexed documents are actually phrased. Both shapes are query problems, not retriever problems, and both are the failure mode this article is about.

The under-specified case shows up constantly in practitioner accounts. Users are not great at writing what they want into search systems: typos, vague queries, limited vocabulary are the rule, not the exception.¹ A query of memory issues in a clinical knowledge base could mean working memory, semantic memory, age-related decline, medication side effects, or any of a dozen other concepts; BM25 has no way to disambiguate among them because the query offers no other terms to score against. The retrieval ends up dominated by whichever documents happen to contain the literal phrase "memory issues" the most times, which is rarely the most useful document.

The over-specified case has a different texture. Consider this query:

What is the degradation rate of lithium iron phosphate batteries at 45 degrees Celsius after 1000 cycles?

It contains plenty of lexical signal, but in the form of a question. The corpus, presumably a collection of battery research papers, contains the same vocabulary in the form of statements:

LFP cells exhibit capacity fade of...

at elevated temperatures (45-60 C)...

The question-form of the query and the statement-form of the documents do not share enough overlapping tokens for BM25 to find the strongest matches, even though a human reading both would recognize them as obviously related. The mismatch the RAG survey calls "the gap between the input text and the needed knowledge in retrieval" is exactly this.²

The standard response to either failure shape is to reach for dense vectors. Train a model that maps both questions and statements into a shared embedding space, search by cosine similarity instead of TF-IDF, problem solved. The maneuver works often enough that it has become the default in most RAG tutorials, but it carries costs that are easy to underestimate: you have to commit to an embedding model, re-embed the entire corpus every time the model changes, host a vector database, manage version drift, and accept that the embedding space is opaque in a way BM25's score components are not. The measuring-retrieval article walked the consequences of those costs in detail.

The query-transformation family asks a different question. What if the lexical mismatch is fixable on the query side, by an LLM that already knows how to translate between question-form and statement-form, between sparse keywords and rich descriptions, between specific instances and general principles? The corpus index stays pure-lexical, the retriever stays BM25, and the LLM intervenes only at the moment the user's text needs to be reshaped into something the index can score well. This is the framing the RAG survey calls "pre-retrieval optimization," and it sits cleanly between the "Naive RAG" approach of feeding the raw query to the retriever and the "Advanced RAG" approach of adding modular pre- and post-retrieval stages.²

The technical decomposition the field has converged on, captured most cleanly in the INTERS framework from ACL 2024, treats LLM-applied-to-IR as three distinct sub-tasks:³

Query understanding. Reshape the user's text into something the retriever can score well against.
Document understanding. Augment, summarize, or rewrite indexed documents so the retriever has better targets to match against.
Query-document relationship understanding. Score, rerank, or filter retrieved pairs with an LLM that reasons over both sides at once.

Each is a place an LLM can be specialized to help the retrieval pipeline. The query-transformation patterns in this article all live in the first sub-task, and they are the cheapest of the three to deploy because they do not require any change to the index. The corpus side stays untouched.

. . .

The Query-Transformation Family at a Glance

Six patterns dominate the literature and the production frameworks. Each takes a different angle on the same problem: how do you reshape the user's text so a lexical retriever can find what they meant? The table below is the map the rest of the article fills in.

Pattern	What the LLM does	Best for	Cost shape
Multi-query	Generate N alternative phrasings of the original query.	Under-specified queries with vocabulary ambiguity.	N parallel retrievals; fused by RRF.
HyDE	Hallucinate a plausible answer document; search with that.	Question-to-statement mismatch; cold-start corpora.	1 extra LLM call per query.
Step-back	Abstract the query to a higher-level concept; search with both.	Over-narrow technical queries needing context.	1 extra LLM call; 2 retrievals.
Query2doc	Generate a pseudo-document; concatenate with original query.	Sparse retrievers; vocabulary expansion.	1 extra LLM call; same number of retrievals.
Decomposition	Split a complex query into atomic sub-questions.	Multi-hop questions spanning several documents.	One retrieval per sub-question; sometimes sequential.
Rewrite-Retrieve-Read	A trained rewriter optimizes the query against retrieval reward.	Production systems with retrieval feedback loops.	One forward pass through a small fine-tuned LM.

These patterns are not exclusive of one another. A production system can run multi-query and step-back together, or chain decomposition with HyDE on each sub-query. The fusion layer that makes those combinations practical is Reciprocal Rank Fusion, which the article gets to in its own section. The patterns also differ in maturity: HyDE, Query2doc, and step-back have anchor papers with empirical evaluations on standard IR benchmarks, while RAG-Fusion entered the field through a Towards Data Science blog post that the framework ecosystem then adopted faster than the academic literature could catch up.¹ The article treats them as a family because the underlying motivation, the shape of the problem they solve, is shared.

. . .

When Each Pattern Wins

The six patterns share a family resemblance, but each has a query shape it was built for. Picking the right one is mostly a matter of recognizing which shape your queries take.

Route by query shape.

Multi-query is the right answer for under-specified queries. Two- or three-word phrases that name a topic without disambiguating among the senses BM25 might match. The LLM's variant generation is essentially asking "what could this user have meant?" and surfacing each plausible interpretation as its own search input. The recall gain compounds across queries because each variant pulls a different slice; the cost is bounded by parallel-call latency, not by sum-of-calls.

HyDE is the right answer for question-to-document vocabulary mismatch. Long natural-language questions whose answers live in statement-form documents the question does not lexically match. The hypothetical document closes the form gap and gives BM25 something stylistically similar to score against. HyDE is also the right answer for cold-start systems with no relevance signal, per its authors' own framing in the paper's discussion section.⁸

Step-back is the right answer for over-narrow technical queries. Queries that pin down a specific time, a specific cell type, a specific reaction condition, a specific error code, and need the broader context the corpus organizes that specific around. The abstraction step finds the foundational documents; the original query finds the specific data point; the LLM composes both. Step-back is the cheapest pattern in the family in latency terms, which makes it the safe default when the query distribution is unknown.

Query2doc is the right answer when you want the original query and an LLM expansion at the same time. Where HyDE replaces the query with a hypothetical document, Query2doc concatenates them, which is more conservative: BM25 still scores against the user's literal tokens and gains the expansion on top. The 3-15% headline range on MS-MARCO and TREC DL is the most BM25-specific empirical anchor in the family.¹⁴

Decomposition is the right answer for multi-hop questions. Questions that genuinely require the multiplication of facts from separate documents, where no single document contains the answer. The cost is the highest of the family, but the alternative for these queries is not "use a cheaper pattern"; the alternative is "produce a wrong answer," because no single retrieval can ground a multi-hop synthesis. The 21-point retrieval lift IRCoT reports is the empirical evidence that the cost is worth paying when the question shape demands it.¹⁸

Rewrite-Retrieve-Read is the right answer for mature systems with relevance feedback loops. Once the system has enough log data to provide a training signal, the rewriter can be specialized to the domain and optimized end-to-end against the reader's success. This is the post-cold-start move, the pattern that takes over once HyDE has done its job of bootstrapping the system through its early life.

The patterns compose, which is the second thing to recognize. A production system can run step-back as a default pre-retrieval transformation, fall back to decomposition for queries the abstraction call flags as multi-hop, and combine multi-query with RRF on top for the queries that survive both classifications. The combination has more knobs than any single pattern, but the underlying patterns are independent enough that the engineering of the combination is mostly orchestration, not redesign. The Raudaschl repository, LangChain's MultiQueryRetriever, and Elasticsearch's RRF retriever all compose cleanly because they share the same fusion primitive at the end.

. . .

Multi-Query Retrieval

Multi-query is the simplest pattern in the family and the one most production frameworks ship out of the box. The user types a query, the LLM generates a small number of alternative phrasings, each phrasing runs through BM25 in parallel, and the result lists are merged into a single ranking. The motivation is recall: different phrasings activate different parts of the inverted index, and the union of their result sets is broader than any single query would have produced.

The seminal practitioner account is Adrian Raudaschl's "Forget RAG, the Future is RAG Fusion," published on Towards Data Science in October 2023.¹ The piece coined the term "RAG-Fusion" and laid out a four-step procedure:

Translate the user query into similar but distinct queries via an LLM.
Perform vector searches for the original and each new query.
Aggregate the result lists using reciprocal rank fusion.
Pass the fused list to the LLM for synthesis.

The procedure as written used vector search, but the same four-step pattern works unchanged with BM25 on the retrieval side; what matters is having multiple ranked lists to fuse.

The companion code in the rag-fusion repository made the pattern concrete with the kind of detail an academic paper rarely supplies.⁴ The repository's default configuration generates four LLM-rewritten variants and runs five total queries (the original plus the four rewrites) in parallel. The RRF k constant is set to 60, matching the SIGIR 2009 default the article gets to later. The variant-generation prompt is a single line: You are a search expert. Generate diverse search queries that explore different aspects of the user's question. followed by Generate 4 diverse search queries for: {original_query}. The simplicity is the point. The LLM does not need elaborate instructions to produce useful variants; it needs to be asked.

LangChain's MultiQueryRetriever ships a slightly more conservative default. The framework's in-tree prompt asks the LLM to "generate 3 different versions of the given user question to retrieve relevant documents from a vector database. By generating multiple perspectives on the user question, your goal is to help the user overcome some of the limitations of distance-based similarity search."⁵ Three variants instead of four, separated by newlines, routed to the retriever in parallel. The framework documentation positions the technique as a fix for the limitations of similarity search specifically, but the recall benefit transfers cleanly to BM25, where the limitations are different but the cure is the same.

The peer-reviewed evaluation of RAG-Fusion came later. Zackary Rackauckas's paper in the International Journal on Natural Language Computing applied the pattern to a real enterprise product-search corpus at Infineon, the German semiconductor company, where engineers, account managers, and customers needed rapid access to technical product documentation.⁶ The qualitative finding was that the technique provided "accurate and comprehensive answers due to the generated queries contextualizing the original query from various perspectives," but Rackauckas also named the specific failure mode the pattern introduces: "some answers strayed off topic when the generated queries' relevance to the original query is insufficient." The same LLM that gives you broader recall can also drift away from the user's actual intent, and the multi-query pattern has no built-in defense against it.

The community handbook at Full Stack Retrieval frames the motivation in terms a working engineer recognizes: the documents returned represent "a more well rounded context for your LLM to work with," because each variant pulls a different slice of the index.⁷ The expanded result set "might be able to answer questions better than docs from a single query." This is the recall-broadening argument in plain language, and it is the reason multi-query has become the default first move when a team measures their RAG system and finds the retriever surfacing too few of the documents the user actually needed.

The cost of multi-query is straightforward to reason about. Each variant is an independent retrieval call, so the retrieval cost is linear in the number of variants. If you run five queries in parallel against the same BM25 index, you pay roughly five times the retrieval compute, but you only pay it once per user query and you can run the calls concurrently. The LLM cost is one variant-generation call per user query, which is small relative to the synthesis call the LLM does at the end of the pipeline. The latency cost is bounded by the slowest of the parallel calls rather than their sum, which a later section will quantify with practitioner numbers. None of these costs are prohibitive, and the recall gain from running four well-chosen variants typically pays for them on any query distribution where the under-specification problem is real.

Multi-Query Retrieval with RRF.

. . .

HyDE: Hypothetical Document Embeddings

HyDE is the most counterintuitive pattern in the family. Instead of searching with the user's question, ask an LLM to write a hypothetical answer to the question, and then search with the hypothetical answer. The hypothetical document, even when factually imperfect, is stylistically closer to documents in the corpus than the question is, because the corpus is written in statement-form and the LLM, given the right prompt, will produce statement-form text too. The retrieval task gets reformulated from question-to-document matching into document-to-document matching, which is the regime almost every retriever, sparse or dense, was trained on or tuned for.

The anchor paper is Gao, Ma, Lin, and Callan's "Precise Zero-Shot Dense Retrieval without Relevance Labels," published at ACL 2023 and originally posted to arXiv in December 2022.⁸ The paper frames the mechanism precisely: "Given a query, HyDE first zero-shot instructs an instruction-following language model (e.g. InstructGPT) to generate a hypothetical document. The document captures relevance patterns but is unreal and may contain false details. Then, an unsupervised contrastively learned encoder (e.g. Contriever) encodes the document into an embedding vector. This vector identifies a neighborhood in the corpus embedding space, where similar real documents are retrieved based on vector similarity." The encoder's "dense bottleneck" is what filters the hallucinations: false details that do not survive the bottleneck cannot anchor the retrieval to incorrect documents.

The reframing the paper makes explicit is the part worth quoting in full, because it captures why HyDE works at all: "Critically, here we offload relevance modeling from representation learning model to an NLG model that generalizes significantly more easily, naturally, and effectively. Generating examples also replaces explicit modeling of relevance scores."⁸ The encoder is no longer asked to model the relationship between a question and an answer; it is asked to model the relationship between two pieces of statement-form text, which is what it was trained to do. The LLM takes over the part the encoder was bad at.

The numbers in the paper are striking: across the two standard TREC DL passage-retrieval benchmarks and the eleven-task BEIR low-resource suite, HyDE produces double-digit absolute lifts over BM25 in every standard metric, with no fine-tuning and no relevance labels.⁸

The original paper used Contriever on top of the hypothetical document, but the relevance-offload argument applies just as well to BM25 on the corpus side: the hypothetical document is statement-form text BM25 scores against the corpus as if it were any other document. Production frameworks have wrapped HyDE in slightly different ways, each with its own configuration defaults; the framework documentation pages also surface two HyDE failure cases (named-entity ambiguity and prior-induced bias on open-ended queries) that the original paper relegated to a single sentence.⁹¹⁰

The authors of the original HyDE paper added a candid lifecycle observation in their discussion section that practitioner accounts often miss: HyDE is most valuable when relevance signal is scarce, and it should be retired as relevance signal accrues. "At the very beginning of the life of the search system, serving queries using HyDE offers performance comparable to a fine-tuned model, which no other relevance-free model can offer. As the search log grows, a supervised dense retriever can be gradually rolled out. As the dense retriever grows stronger, more queries will be routed to it, with only less common and emerging ones going to HyDE backend."⁸ The technique is a cold-start tool by its own authors' framing, valuable specifically when the team does not yet have enough click data or labels to train a retriever properly.

. . .

Step-Back Prompting

Step-back prompting comes at the over-specification problem from the opposite angle from HyDE. Where HyDE imagines a hypothetical answer and uses it as the query, step-back imagines a more general version of the question and searches with both. The intuition is that very-specific questions retrieve narrow slices of the corpus and miss the foundational context an answer typically needs. Abstracting the question to a higher level surfaces the documents that contextualize the specific answer.

The anchor paper is Zheng et al.'s "Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models," from Google DeepMind, published at ICLR 2024.¹¹ The mechanism is stated cleanly: "We present Step-Back Prompting, a simple prompting technique that enables LLMs to do abstractions to derive high-level concepts and first principles from instances containing specific details. Instead of addressing the question directly, we first prompt the LLM to ask a generic step-back question about a higher-level concept or principle." The worked example in the paper is illustrative: the specific question "Which school did Estella Leopold attend between Aug 1954 and Nov 1954?" becomes the step-back question "What was Estella Leopold's education history?" The narrow temporal slice gives way to a question whose answer naturally contains the narrow slice.

The retrieval-augmented version of the pattern, which is the one this article cares about, runs two retrievals instead of one. "Using the step-back question, we do retrieval augmentation. Using both the retrieval augmentations from the original question and the step-back question, we formulate the final prompt."¹¹ The original question's retrieval finds the documents that mention Estella Leopold's specific 1954 enrollment; the step-back question's retrieval finds the documents that lay out her full educational arc. The final LLM prompt has both, and the answer can ground itself in either or both.

The empirical results in the paper cover four benchmarks, with PaLM-2L lifted over baseline as follows:¹¹

MMLU Physics: +7%
MMLU Chemistry: +11%
TimeQA (temporal reasoning): +27%
MuSiQue (multi-hop QA): +7%

The TimeQA result deserves special attention because the dataset is exactly the kind of over-specified-query distribution step-back was designed for: each question pins down a specific time window and asks what was true during it, and the right answer typically lives inside a broader contextual document the narrow query would never surface.

The TimeQA numbers with RAG explicit are worth quoting: step-back alone reaches 66% accuracy, step-back plus RAG reaches 68.7%, and on the harder instances of the benchmark, step-back hard-accuracy hits 62.3%, outperforming GPT-4's 42.6% on the same slice.¹¹ The combination is the point: the abstracted query brings in the contextual documents, RAG brings in the retrieved evidence, and the final synthesis has access to both. Neither alone matches the joint performance.

The error analysis in the same paper is what tempers the headline numbers. After step-back is applied, "more than half of the errors are due to reasoning errors" while "45% of errors are due to failure in retrieving the right information."¹¹ The decomposition tells you what step-back can and cannot fix. It cannot make a weaker LLM reason better, and it cannot rescue a retrieval pipeline that is still missing the relevant documents even with the abstracted query. The fix-rate the paper reports for step-back plus RAG is similarly honest: the combination fixes 39.9% of the predictions where the baseline was wrong, while introducing 5.6% new errors. The net is positive but bounded.

The conceptual ancestor of step-back is least-to-most prompting, introduced by Zhou et al. at ICLR 2023.¹² Least-to-most articulates the decomposition principle that both step-back and full query decomposition build on: "The key idea in this strategy is to break down a complex problem into a series of simpler subproblems and then solve them in sequence. Solving each subproblem is facilitated by the answers to previously solved subproblems." The SCAN result from that paper is the dramatic one (99% accuracy with 14 exemplars versus 16% for chain-of-thought), but the conceptual contribution is more durable: complex problems often admit a layered solution where each layer is easier than the whole.

Step-back applies that principle to retrieval. Instead of trying to retrieve the answer to the complex specific question directly, retrieve the answer to the simpler general question and let the LLM compose the specific answer from the general context. The pattern is the cheapest of the family in latency terms, because the abstraction step is a single LLM call and the two retrievals run in parallel. A practitioner survey that we look at in the cost-ledger section ranks step-back as "low to medium" latency, describing the added cost as "minimal."¹³

. . .

Query2doc and Pseudo-Document Generation

Query2doc is the pattern that pushed the pseudo-document idea explicitly toward sparse retrieval. Where HyDE was framed for dense embedding-based retrieval and used the hypothetical document as the search input, Query2doc generates a pseudo-document and concatenates it with the original query before handing the combined text to BM25. The original query stays in the search input; the pseudo-document acts as a vocabulary expansion that gives BM25 more terms to score against.¹⁴

The headline empirical result is the BM25-specific number that anchors this article's argument. Query2doc "boosts the performance of BM25 by 3% to 15% on ad-hoc IR datasets, such as MS-MARCO and TREC DL, without any model fine-tuning."¹⁴ A 15% improvement on a thirty-year-old sparse retriever, achieved by changing nothing about the index and adding a single LLM call to the query path, is the kind of result that justifies the entire query-transformation research agenda. The range matters too: 3% on the easy cases, 15% on the hard ones, with the precise gain depending on how much the original query suffered from vocabulary mismatch.

Why does the pattern work? The paper's own explanation is that pseudo-documents "often contain highly relevant information that can aid in query disambiguation and guide the retrievers." A short query of memory issues gets a pseudo-document about working memory, cognitive load, age-related decline, and medication interactions, all expressed in the vocabulary the corpus actually uses. BM25 now has dozens of terms to score against instead of two, and the IDF-weighted overlap with the right documents jumps accordingly. The vocabulary-mismatch problem the article opened with is exactly the problem this pattern was designed to solve.

The prompt template in the paper is austere. A brief instruction:

Write a passage that answers the given query:

Plus a small number of in-context examples drawn from the target dataset. Few-shot, not zero-shot, which matters: the in-context examples teach the LLM the style of the corpus's pseudo-documents. The released generations come from text-davinci-003 and are publicly available on Hugging Face, which gives downstream researchers a reproducible artifact to build on rather than having to regenerate the pseudo-documents themselves.¹⁴

The paper also reports a finding that has aged into common wisdom: model size matters for this pattern. "[Query2doc] works best when combined with the most capable LLMs while small language models only provide marginal improvements over baselines."¹⁴ A weak LLM cannot produce a pseudo-document that contains the right vocabulary and the right factual associations, and a pseudo-document full of generic filler does not help BM25 find anything it would not have found anyway. The query-transformation patterns share this property: the value of the pattern is bounded above by the quality of the LLM doing the transformation.

A second primary source covers the same territory from Google's side of the field. Jagerman, Zhuang, Qin, Wang, and Bendersky's "Query Expansion by Prompting Large Language Models," from the Gen-IR Workshop at SIGIR 2023, frames the design space as three prompting strategies: zero-shot, few-shot, and chain-of-thought.¹⁵ The paper's specific finding on CoT is the one practitioners returned to most: "CoT prompts are especially useful for query expansion as these prompts instruct the model to break queries down step-by-step and can provide a large number of terms related to the original query." The breakdown-style decomposition produces more expansion terms than a direct generation request would, which directly helps BM25's term-overlap mechanics. The headline result on MS-MARCO and BEIR is that "query expansions generated by LLMs can be more powerful than traditional query expansion methods" like RM3 or BM25-PRF, which had been the previous state of the art in sparse query expansion.

The pre-LLM-era precedent for both Query2doc and HyDE-over-sparse-retrieval is Mao et al.'s Generation-Augmented Retrieval, published at ACL 2021.¹⁶ GAR "augments a query through text generation of heuristically discovered relevant contexts without external resources as supervision," and the headline empirical result is the one the LLM-era papers later replicated at larger scale: "GAR with sparse representations (BM25) achieves comparable or better performance than state-of-the-art dense retrieval methods such as DPR." Two years before HyDE, Mao et al. had already demonstrated that the right generated context, fed to BM25 instead of replacing it with a dense retriever, could match the field's then-best dense baseline.

The GAR paper also previewed the fusion argument that RRF later made canonical: "Generating diverse contexts for a query is beneficial as fusing their results consistently yields better retrieval accuracy." Multiple generated contexts, each retrieved separately, then fused. The pattern is multi-query plus pseudo-document in one technique, with the fusion left abstract. The LLM-era refinement was to make the fusion concrete (Reciprocal Rank Fusion) and the generator powerful (GPT-3 or larger), but the structural insight was already there.

. . .

Query Decomposition

The patterns covered so far rewrite or expand a single user query into a single (possibly augmented) retrieval input. Query decomposition does something structurally different: it breaks the user query into a sequence of atomic sub-queries, each of which is retrieved against the index independently, and the answers are composed at the end. Decomposition is the pattern for multi-hop questions, where the answer requires evidence from documents that no single query would surface.

The compositionality gap is the formal motivation. Press et al.'s "Measuring and Narrowing the Compositionality Gap in Language Models," published at Findings of EMNLP 2023, established that "as model size increases we show that the single-hop question answering performance improves faster than the multi-hop performance does."¹⁷ Bigger models get better at one-step questions faster than they get better at multi-step questions, which means the gap between what the model knows and what it can compose grows even as the model improves. Decomposition is one way to close the gap from the outside: have the model explicitly produce the sub-questions, answer each, then compose.

The pattern Press et al. introduced is called Self-Ask. The mechanism is straightforward: "The model explicitly asks itself (and answers) follow-up questions before answering the initial question." The structured prompting that Self-Ask produces is what makes the search integration tractable: "Self-ask's structured prompting lets us easily plug in a search engine to answer the follow-up questions, which additionally improves accuracy."¹⁷ Each follow-up question becomes a search input; the search results inform the LLM's answer to that follow-up; the chain continues until the LLM has enough to answer the original question.

The empirical headline is the Bamboogle result. On Bamboogle, a multi-hop QA benchmark Press et al. introduced specifically to stress-test compositionality, Self-Ask "improves over chain of thought by smaller margins on 2WikiMultiHopQA and Musique datasets but by a large 11% (absolute) on Bamboogle."¹⁷ Eleven absolute points is a substantial gain for a method that does not change the model or the index, only the prompting structure.

The IRCoT paper, from Trivedi, Balasubramanian, Khot, and Sabharwal at ACL 2023, took the same intuition further by interleaving the retrieval with the reasoning rather than treating them as separate phases.¹⁸ The motivation the paper articulates is the natural one: "This one-step retrieve-and-read approach is insufficient for multi-step QA because what to retrieve depends on what has already been derived, which in turn may depend on what was previously retrieved." Each chain-of-thought step emits a new sub-query for retrieval, the retrieved evidence informs the next reasoning step, and the cycle continues until the LLM produces a final answer.

The IRCoT empirical results quantify the ceiling that one-shot decomposition runs into. "Using IRCoT with GPT3 substantially improves retrieval (up to 21 points) as well as downstream QA (up to 15 points) on four datasets: HotpotQA, 2WikiMultihopQA, MuSiQue, and IIRC."¹⁸ The 21-point retrieval gain is the headline, and the paper also notes that "IRCoT reduces model hallucination, resulting in factually more accurate CoT reasoning." The intuition is that retrieved evidence at each step grounds the next step, which prevents the cascade of reasoning errors that pure CoT can produce.

The framework implementations of decomposition take both shapes. LlamaIndex's SubQuestionQueryEngine takes the single-shot approach: a single user query is broken into multiple sub-questions, each is answered against the same index, and a final LLM call synthesizes the answers.¹⁹ The worked example in the documentation makes the pattern concrete: the query "How was Paul Grahams life different before, during, and after YC?" decomposes into three sub-questions, "What did Paul Graham work on before YC?", "What did Paul Graham work on during YC?", and "What did Paul Graham work on after YC?" Three sub-questions, three retrievals, one synthesis at the end. The same framework's StepDecomposeQueryTransform takes the iterative approach: each step's answer informs the next sub-query, against the same underlying retriever, which is the framework-level analog of IRCoT.²⁰

The decomposition pattern has the highest cost of the family. The DMFlow architecture comparison ranks it as "high" latency, citing the multiple LLM calls required.¹³ If you decompose into four sub-questions, you pay four retrieval costs and either four sequential LLM calls (for IRCoT-style interleaved retrieval) or one decomposition call plus four parallel retrievals plus a synthesis call (for the LlamaIndex sub-question style). The latency depends on whether you can parallelize, and the parallelizable variant is what makes the pattern practical for sub-second response times. The total token cost grows linearly with the number of sub-questions.

The pattern is worth the cost when the queries genuinely require multi-hop reasoning. A query like "What proportion of the cobalt used in U.S.-manufactured EV batteries originates from artisanal mines in the Democratic Republic of Congo?" cannot be answered by any single document; it requires a document about U.S. EV battery supply chains, a document about global cobalt sourcing, and a document about DRC artisanal mining proportions, and the answer is a multiplication of three numbers retrieved separately. Decomposition is the only pattern in the family that handles this query at all. Multi-query produces variants that hit the same single-document gap; HyDE produces a hypothetical answer that is itself a guess at the multiplication; only decomposition explicitly retrieves each input number and lets the LLM compose.

. . .

Rewrite-Retrieve-Read

The patterns above all use zero-shot or few-shot prompting to drive the query transformation: the LLM is the same large frontier model that does the final synthesis, and its rewriting behavior is steered only by the prompt. The Rewrite-Retrieve-Read framework, introduced by Ma, Gong, He, Tan, and Lin at EMNLP 2023, asks a different question: what if the query rewriter were itself trained, end-to-end, against the retrieval outcome the system actually cares about?²¹

The framing the paper adopts is explicit: "Unlike prior studies focusing on adapting either the retriever or the reader, our approach pays attention to the adaptation of the search query itself." The motivation is the same gap this article opened with: "there is inevitably a gap between the input text and the needed knowledge in retrieval." The novelty is that the gap-closing model is a small, dedicated, trainable language model rather than a prompted frontier LLM.

The architecture is three stages: rewrite, retrieve, read. "A small language model is adopted as a trainable rewriter to cater to the black-box LLM reader. The rewriter is trained using the feedback of the LLM reader by reinforcement learning."²¹ The reward signal comes from the reader: did the rewritten query lead to a retrieval that let the reader produce the right answer? The training loop optimizes the rewriter against that reward. The pipeline runs as the name suggests: "First, an LLM is used to generate the query, then a web search engine to retrieve contexts."

The empirical eval in the paper covers open-domain QA and multiple-choice QA. The reported result is qualitative rather than dramatic: "Experiments results show consistent performance improvement, indicating that our framework is proven effective and scalable" across both task families.²¹ The contribution is less about a specific accuracy number and more about the architectural pattern: a small fine-tuned model can be specialized for query rewriting and optimized end-to-end against retrieval outcomes, which is something the zero-shot prompted approaches cannot do.

The companion code release, in the RAG-query-rewriting GitHub repository, supplies the reference implementation of the RL fine-tuning loop.²² The repository is what lets a team that wants to try the technique do it without re-deriving the training procedure from the paper.

The architectural significance of Rewrite-Retrieve-Read is what it implies for the long arc of query transformation. If the rewriter is trainable, it can be specialized to a domain. If the reward signal is the reader's success, the rewriter can be optimized for the specific corpus and the specific task the system serves, rather than for the generic intuitions a frontier LLM brings from its pretraining. The cost is one more model to train, host, and version, but the payoff is a rewriter that improves with the system's relevance log in a way zero-shot prompting cannot. For systems that have outgrown the cold-start lifecycle stage HyDE was framed for, Rewrite-Retrieve-Read is the next move.

. . .

Reciprocal Rank Fusion: The Glue Layer

The patterns that produce multiple ranked lists (multi-query, step-back, GAR-style diverse-context generation) all need a way to combine those lists into a single final ranking. Reciprocal Rank Fusion is what the field has settled on. The technique is older than the LLM era by more than a decade, and the elegance of its design is what kept it relevant long enough for the new patterns to adopt it.

The anchor paper is Cormack, Clarke, and Buttcher's "Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods," published at SIGIR 2009.²³ The headline claim is direct: "Reciprocal Rank Fusion (RRF) ... consistently yields better results than any individual system, and better results than the standard method Condorcet Fuse." The formula is the simplest plausible one that could work, and the fact that it works is part of what makes it interesting.

The formula sorts documents by a sum across input systems: for each document d and each ranked list r, contribute 1 / (k + r(d)) to the document's RRF score, where r(d) is the rank of document d in list r and k is a smoothing constant. Sum across all input lists. The document with the highest sum wins. The constant k is set to 60 in the original paper, and the paper shows the choice performs well across multiple TREC tracks.²³ Every implementation in the modern stack inherits that 60: the Raudaschl rag-fusion repository defaults to k=60,⁴ the Elasticsearch implementation defaults to rank_constant=60,²⁴ and the LangChain and LlamaIndex defaults follow suit.

The property that makes RRF work without per-list score calibration is that it ignores the raw scores entirely. Documents at the top of any input list get a high contribution; documents appearing in multiple lists get additive boosts; documents the system never saw get nothing. The raw BM25 score, the vector cosine similarity, the cross-encoder reranker score, none of them enter the fusion. Only the rank does, which means you can fuse a BM25 list with a dense-retrieval list with a learned-sparse list and the fusion does not need to know that their scores live on incommensurate scales. The Elasticsearch documentation makes this property explicit: RRF "requires no tuning, and the different relevance indicators do not have to be related to each other to achieve high-quality results."²⁴

For a multi-query architecture over BM25, the fusion is what turns a collection of imperfect parallel retrievals into a single coherent ranking. Each variant's retrieval is best at the documents that match its specific phrasing; RRF rewards a document that shows up in multiple variants' top-K with an additive boost, which is exactly the property you want when the variants are supposed to be different phrasings of the same underlying need. A document that only one variant found gets a smaller contribution, which is also correct: that document is at higher risk of being off-topic for the user's actual intent.

Elasticsearch ships RRF as a first-class retriever primitive, which is one of the strongest signals that the multi-query-then-RRF pattern has graduated from a userspace experiment into a production-default architecture. The Elasticsearch reference documents the API directly: minimum of two child retrievers, configurable rank_constant (defaulting to 60), configurable rank_window_size (the number of documents from each child list to consider).²⁴ Elastic's engineering blog extends the pattern with weighted RRF, where each child retriever gets its own weight in the fusion, which is relevant when the LLM-generated variants are of uneven quality.²⁵

The fusion layer is what makes the multi-query family tractable. Without RRF, combining N ranked lists into one requires either training a learned rank-aggregation model or hand-tuning per-list weights, both of which are operational burdens that scale badly. With RRF, the fusion is a closed-form computation with one tuning knob that most teams leave at its default. The pattern is one of the rare cases where the simplest thing that could work also happens to be the thing that wins on the benchmarks, and the field has rewarded that simplicity with near-universal adoption.

. . .

The Playground

The decision framework earlier in the article ranks the patterns by query shape. The deep-dive sections explain the mechanics. What neither of those gives is a feel for how much actual retrieval changes when a transformation is applied. The playground below closes that gap on a 40-document oil-and-gas corpus, with six preset queries chosen because each one fails in a different way against a literal BM25 search.

Transformations applied to a 40-document corpus.

BM25 is real, running live in the browser against the corpus on every query change. The HyDE, decomposition, and step-back transformations on the preset queries are hand-curated to read like a competent LLM's output, frozen as JSON. Custom queries run literal BM25 only, with the transformations panel switching to "no transformations available for ad-hoc queries" so the demo does not lie about what is computed. Each preset query carries a relevance judgment, which lets the demo report Precision@5 per transformation. The number is the most honest summary of what each pattern bought, on this corpus, for this query.

A few queries are worth running. Stuck pipe handling is short and under-specified, where multi-query gains the most. Wolfcamp mud weight at 11,500 feet is a long natural-language question that HyDE was built for. Block 329 production underperformance is a multi-hop question whose answer requires combining the field-average benchmark, the operator's 2023 program, and the completion-failure diagnostics, which is decomposition's case. The remaining presets cover step-back, lost circulation, and decline-rate ranges, each chosen to put a different transformation in its best light.

. . .

The Cost Ledger

Every pattern in this article trades something for the recall and precision gains it produces. The trade is usually latency: at least one extra LLM call per user query, sometimes several, and in some cases multiple retrievals on top of that. A serious treatment of the family has to put numbers on those trades, because the decision of which pattern to deploy depends on the latency budget the system has to work with.

The DMFlow architecture comparison is the most useful side-by-side practitioner survey of the patterns, ranking each on a qualitative latency scale.¹³ Step-back sits at "low to medium," described as "minimal latency," because the abstraction step is a single short LLM call and the two retrievals run in parallel. HyDE sits at "medium," reflecting the cost of one additional LLM call to generate the hypothetical document. Multi-query is also "medium" overall, because the N parallel retrievals are bounded by the slowest call rather than their sum. RAG-Fusion (multi-query plus full RRF plus optional reranking) hits "high" latency due to the multiple retrievals and the post-retrieval reranking pass, with the survey calling out the "high computational cost and latency" of the full architecture. Query decomposition is also "high," because the multiple sub-question LLM calls are often sequential rather than parallel, particularly in the IRCoT-style interleaved variant.

Cost shape per pattern.

The LangChain advanced retrieval tutorial supplies the quantitative number practitioners actually plan against. "Multiple queries increase search time by 2-3x, but optimization can use async/parallel search when possible."²⁶ The same tutorial gives the bound that matters most: "Total added latency is bounded by the slowest contextualizer, not the sum of all of them. In practice, this means 100 to 300ms of added pre-retrieval time." A hundred milliseconds is invisible to a user; three hundred is noticeable but acceptable for a chat-style RAG response. The pattern fits comfortably inside a sub-second latency budget when the LLM calls are issued concurrently and a well-provisioned LLM endpoint serves them.

The token-cost side of the ledger is straightforward to estimate. Each LLM call in the rewriting path consumes some number of input and output tokens at the model's per-token price. Per pattern:

Multi-query: one prompt of 100 to 200 tokens, with N short outputs in a single call.
HyDE: one passage of 200 to 400 tokens.
Step-back: one abstraction question of 15 to 30 tokens.
Query2doc: one pseudo-document of 100 to 300 tokens.
Decomposition: N sub-questions at 15 to 30 tokens each, plus one synthesis call at the end.

None of these is more than a fraction of the cost of the final synthesis call the LLM does over the retrieved context, which is typically the dominant token expense in a RAG pipeline.

The recall gain side of the ledger is harder to estimate in advance because it depends on the query distribution and the corpus. The empirical anchors the article has already documented are the floor: HyDE's double-digit nDCG@10 lifts on TREC DL19 and DL20, Query2doc's 3-15% BM25 improvement range on MS-MARCO and TREC DL, step-back's 27% gain on TimeQA, IRCoT's 21-point retrieval improvement on multi-hop QA. The actual gain on a given production system will sit somewhere in that range, and the only honest way to size it is to run the evaluation loop the measuring-retrieval article walked.

The latency-versus-recall trade is what decides whether a pattern earns its place. A system that needs sub-200ms end-to-end response time cannot afford full query decomposition, even when the queries are genuinely multi-hop, because the sequential LLM calls do not fit the budget. A system whose recall is already 95% on its core query distribution does not need multi-query, because the marginal gain over a stronger baseline is small. A system at the cold start of its lifecycle, with no relevance signal and no fine-tuned retriever, gets the largest absolute gain from HyDE specifically, which is what its authors framed it for. The decision is not "which pattern is best." The decision is which pattern's cost shape matches the system's constraints.

. . .

The Honest Limitations

Every pattern in this article has a failure mode that production users hit, and a serious treatment has to name them. The literature documents most of these directly; the rest come from the framework docs that the authors of the techniques themselves write.

Query drift. An LLM that generates variants, hypothetical documents, or sub-questions can drift away from the user's actual intent, and when it drifts the downstream retrieval drifts with it. Rackauckas names this failure mode directly in the RAG-Fusion evaluation: "some answers strayed off topic when the generated queries' relevance to the original query is insufficient."⁶ The LLM that gives you broader recall in the average case can also widen the retrieval window to off-topic documents in the cases where it misreads the query. The standard mitigation is a cross-encoder reranker after RRF, which the rag-fusion repository's hybrid_diverse+rerank configuration includes by default, and which the measuring-retrieval article walks in its reranking section.⁴

Ambiguous queries break HyDE specifically. The LlamaIndex HyDE documentation calls out the failure mode with a worked example: "HyDE mis-interprets Bel without document context...resulting in a completely unrelated embedding string and poor retrieval outcome."¹⁰ A two- or three-word query containing a named entity the LLM has no context for produces a hypothetical document about the wrong entity, and the resulting retrieval anchors the search around the wrong topic entirely. The mitigation is the include_original=True flag the LlamaIndex implementation exposes, which averages the hypothetical document embedding with the original query embedding and hedges against the worst drift, but the safer move on a query distribution dominated by short ambiguous queries is to skip HyDE entirely.

Open-ended queries skew toward the LLM's prior. The same LlamaIndex documentation names the second HyDE failure mode: "HyDE may bias open-ended queries," producing skewed outputs that favor certain interpretations over balanced responses.¹⁰ The hypothetical document is, in a real sense, a sample from the LLM's prior over what an answer to the question might look like, and that prior is shaped by the LLM's training distribution. A medical query whose corpus contains the latest research will retrieve documents biased toward whatever the LLM saw most often in its training data, which may or may not match the corpus's actual emphasis.

Latency at scale. The 100-to-300-millisecond pre-retrieval overhead the LangChain tutorial cites is acceptable for a chat-style RAG response measured in seconds; it is not acceptable for a low-latency search experience measured in tens of milliseconds.²⁶ A patent-search system that needs to return ranked results in 50ms cannot afford even one LLM call in the query path, much less the five or six that full RAG-Fusion implies. The transformation patterns are right for RAG; they are wrong for the autocomplete-and-rank workloads classical IR was built for.

Model size dependence. Wang et al. observed directly that Query2doc "works best when combined with the most capable LLMs while small language models only provide marginal improvements over baselines."¹⁴ The same constraint applies to HyDE, multi-query variant generation, and step-back abstraction. A small or distilled LLM driving the query transformation produces transformations that are either generic, off-topic, or both, and the value of the pattern collapses. The patterns are not free in the sense of being usable with any LLM; they need a strong-enough LLM to do the linguistic work the transformation requires.

Retrieval-versus-reasoning split. Even with the best query transformation, retrieval still accounts for a substantial fraction of the remaining errors. The DeepMind Step-Back paper's own error analysis is the clearest evidence: after step-back is applied, "45% of errors are due to failure in retrieving the right information," and the rest are reasoning errors the technique cannot fix.¹¹ The patterns in this article are necessary but not sufficient: they close part of the gap, and downstream reranking, decomposition, and reader-side improvements are still load-bearing for the part they leave open.

Gameability of pseudo-document generation. A pseudo-document generated by an LLM is a sample from the LLM's distribution, which means it inherits whatever biases that distribution has. If the LLM tends to over-represent certain phrasings, certain entities, or certain interpretations, the pseudo-document will push retrieval toward documents that match those patterns. This is the open-ended-bias case generalized: it is not only open-ended queries that get biased, but any query where the LLM's prior is not uniform over the corpus's content. The mitigation is averaging across multiple generations (the N=5 default in Haystack's HyDE implementation is exactly this), but averaging only smooths the bias; it does not remove it.⁹

Lifecycle: HyDE retires. The original HyDE authors put a lifecycle note in their discussion section that the practitioner literature often skips: HyDE is most valuable at the beginning of a search system's life, and it should be gradually retired as the system accrues click-logs and relevance signal to train a supervised retriever.⁸ The technique is a cold-start tool, not a forever tool. A team that has been running HyDE for two years on a high-traffic system, and has accumulated millions of judged or quasi-judged query-document pairs in the process, should be asking whether some of that signal would be better spent on training a Rewrite-Retrieve-Read-style specialized model instead.

None of these limitations is a reason not to deploy the patterns. They are a reason to deploy them with the same measurement discipline the measuring-retrieval article called for: judged eval set, stratified metrics, before-and-after on the queries the technique was supposed to help, plus the queries it might hurt. The closed-loop discipline that BM25 evaluation enables is exactly what catches a query-transformation pattern that helps the average case while degrading the long-tail or the safety-critical slice. Without the loop, the patterns are religion. With the loop, they are engineering.

. . .

Where This Fits in the Week 5 Stack

The Week 5 articles cover the lexical retrieval layer of a RAG pipeline at three levels of resolution. The classic-search article walks the retriever itself: how an analyzer turns text into tokens, how an inverted index stores those tokens, how BM25 scores documents against a query, what the user sees when they type a query and hit enter. That article is the foundation. Everything in the present article assumes you have already understood the BM25 pipeline as a closed-form, hand-computable, parameter-tunable mechanism.

The measuring-retrieval article walks the evaluation discipline that tells you whether the retrieval pipeline is working. The five-step BM25 evaluation loop (build a relevance-judged eval set, choose metrics, run BM25, sweep k1 and b, stratify the metrics) is the measurement infrastructure every claim in this article ultimately rests on. The "When Measurement Reveals a Problem" section of that article named HyDE, multi-query, step-back, and query expansion as the moves available when stratified evaluation revealed a query-side failure, but stopped short of explaining each pattern in depth. The article you are reading now is the deep treatment of that section.

The pairing matters because the choice of pattern from this article depends on what the measurement loop from the previous article surfaces. If stratified evaluation shows BM25 doing well on identifier queries and badly on long natural-language queries, HyDE or Query2doc are the right interventions to try first. If the failure mode is on multi-hop questions specifically, decomposition is the right pattern. If recall is uniformly mediocre across query types, multi-query with RRF is the broadest fix. Without the stratified picture from the measurement loop, you are guessing at which pattern to deploy, which produces a deployment that may or may not improve anything.

The embedding-models-for-rag companion article covers the dense-vector side of the question, which is the other answer to the same problem this article addresses. The two answers are not in competition. A mature production system in 2026 often runs both: BM25 with query transformation on the lexical side, dense embeddings with hybrid fusion on the semantic side, and a learned reranker over the union. The point of this article is that the lexical side, augmented with LLM-driven query transformation, is more capable than the "BM25 is for keywords, embeddings are for meaning" narrative suggests. Most of the meaning the user is asking about can be surfaced through better queries, without changing the retriever.

The simple-rag-walkthrough demo and the simplest-possible-rag companion piece show what the bare minimum looks like: BM25 plus LLM synthesis, no transformation, no reranking. That is the baseline against which the patterns in this article are improvements. The improvements are not free, but they are cheap enough relative to the synthesis call at the end of the pipeline that most production RAG systems should be running at least one of them. The pattern most teams should reach for first is step-back, because it is the cheapest in latency, the simplest in implementation, and the most robust to the failure modes that hit HyDE and multi-query harder.

The thesis the article opened with is the one the empirical evidence supports. Most user queries are wrong for the index they are searching, and the LLM is the most cost-effective fix for that wrongness. The corpus side stays pure-lexical, the inverted index stays untouched, BM25 stays the retriever the engineering team understands and controls. The LLM intervenes at exactly the moment the user's text needs to be reshaped, and the reshaping is where the recall and precision gains come from. Six patterns, one fusion layer, one cost ledger, and the closed evaluation loop the previous article walked. That is the query-transformation stack.

. . .

References

Raudaschl, A. H. (2023). "Forget RAG, the Future is RAG Fusion." Towards Data Science. The practitioner write-up that coined "RAG-Fusion": LLM generates multiple query variants, each is retrieved in parallel, results are fused with Reciprocal Rank Fusion.
Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M., & Wang, H. (2023). "Retrieval-Augmented Generation for Large Language Models: A Survey." arXiv. The canonical taxonomy that places query-side transformation in the pre-retrieval phase of the RAG stack.
Zhu, Y., Zhang, P., Zhang, C., Chen, Y., Xie, B., Liu, Z., Wen, J.-R., & Dou, Z. (2024). "INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning." ACL 2024 Main. The formal three-way decomposition of LLM-for-IR into query understanding, document understanding, and query-document relationship understanding.
Raudaschl, A. H. (2023). "rag-fusion GitHub repository." The reference implementation of RAG-Fusion: 4 LLM-generated variants plus the original, fused by RRF with k=60.
LangChain. (2024). "MultiQueryRetriever." LangChain documentation. The in-tree framework implementation pattern with a default prompt that asks for 3 variants.
Rackauckas, Z. (2024). "RAG-Fusion: A New Take on Retrieval-Augmented Generation." International Journal on Natural Language Computing, Vol. 13 No. 1. The first peer-reviewed evaluation of RAG-Fusion on a real enterprise product-search corpus at Infineon.
Full Stack Retrieval. "Multi-Query." Full Stack Retrieval community handbook. Practitioner-level documentation of the recall-broadening motivation for multi-query.
Gao, L., Ma, X., Lin, J., & Callan, J. (2023). "Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE)." ACL 2023 Long Papers. The anchor paper for HyDE: an LLM generates a hypothetical answer document whose style is closer to indexed documents than the question is, then the retrieval is done with that pseudo-document. Reports BM25 vs HyDE on TREC DL19/DL20.
Haystack / deepset. (2024). "Hypothetical Document Embeddings (HyDE)." Haystack documentation. Reference walkthrough confirming the paper's N=5 sampled hypothetical documents, averaged, with concrete generation kwargs.
LlamaIndex. (2024). "HyDE Query Transform Demo." LlamaIndex documentation. Implementation with the include_original flag plus documented failure modes for ambiguous and open-ended queries.
Zheng, H. S., Mishra, S., Chen, X., Cheng, H.-T., Chi, E. H., Le, Q. V., & Zhou, D. (2024). "Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models." Google DeepMind. ICLR 2024. Introduces step-back prompting: abstract the query to a higher-level concept, search with both. Reports MMLU, TimeQA, MuSiQue numbers.
Zhou, D., Schaerli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q., & Chi, E. (2023). "Least-to-Most Prompting Enables Complex Reasoning in Large Language Models." Google. ICLR 2023. The conceptual ancestor of step-back and query decomposition.
DMFlow. (2024). "Stop Your RAG System from 'Missing the Point': A Deep Dive into Six Advanced Query Transformation Architectures." DMFlow engineering blog. Practitioner side-by-side comparison of HyDE, Step-Back, Multi-Query, RAG-Fusion, and Query Decomposition along the latency axis.
Wang, L., Yang, N., & Wei, F. (2023). "Query2doc: Query Expansion with Large Language Models." Microsoft Research. EMNLP 2023. The anchor paper for the technique of generating an LLM pseudo-document and concatenating it with the original query before BM25 retrieves. Reports 3-15% BM25 improvement on MS-MARCO and TREC DL.
Jagerman, R., Zhuang, H., Qin, Z., Wang, X., & Bendersky, M. (2023). "Query Expansion by Prompting Large Language Models." Google Research. Gen-IR Workshop @ SIGIR 2023. A second primary source on LLM-driven query expansion for BM25-style retrieval, comparing zero-shot, few-shot, and CoT prompting strategies.
Mao, Y., He, P., Liu, X., Shen, Y., Gao, J., Han, J., & Chen, W. (2021). "Generation-Augmented Retrieval for Open-Domain Question Answering." Microsoft / UIUC. ACL 2021. The pre-LLM-era precedent for query2doc and HyDE-over-sparse-retrieval: augmenting a query with a generated context lets BM25 match or beat dense retrieval on open-domain QA.
Press, O., Zhang, M., Min, S., Schmidt, L., Smith, N. A., & Lewis, M. (2023). "Measuring and Narrowing the Compositionality Gap in Language Models (Self-Ask)." Findings of EMNLP 2023. Introduces the compositionality gap and Self-Ask, the explicit-follow-up-question pattern for multi-hop QA. Reports an 11% absolute improvement on Bamboogle.
Trivedi, H., Balasubramanian, N., Khot, T., & Sabharwal, A. (2023). "Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions (IRCoT)." ACL 2023. Documents the interleaved retrieval pattern: each CoT step emits a new sub-query, retrieved evidence guides the next step. Reports up to 21 retrieval points and 15 QA points on HotpotQA, 2WikiMultihopQA, MuSiQue, IIRC.
LlamaIndex. (2024). "Sub Question Query Engine." LlamaIndex documentation. Reference implementation of the LLM-decomposition pattern: a single user query is split into sub-questions, each retrieved against the index, then synthesized.
LlamaIndex. (2024). "Query Transformations (Multi-Step Decomposition)." LlamaIndex documentation. The StepDecomposeQueryTransform, framework-level analog of IRCoT.
Ma, X., Gong, Y., He, P., Zhao, H., & Duan, N. (2023). "Query Rewriting for Retrieval-Augmented Large Language Models (Rewrite-Retrieve-Read)." Microsoft Research Asia / SJTU. EMNLP 2023. The anchor paper for the three-stage framework where the query-rewriting model is itself trainable, optimized end-to-end with RL against the reader's reward signal.
Ma, X., et al. (2023). "RAG-query-rewriting GitHub repository." Reference implementation of the trainable-rewriter approach, with code for the RL fine-tuning loop.
Cormack, G. V., Clarke, C. L. A., & Buttcher, S. (2009). "Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods." SIGIR 2009. The anchor paper for RRF. Establishes the formula, the k=60 default, and the empirical result that RRF beats both Condorcet Fuse and learned rank-aggregation methods on TREC data.
Elastic. (2024). "Reciprocal rank fusion." Elasticsearch official documentation. RRF as a first-class retriever primitive with rank_constant defaulting to 60, directly inheriting the Cormack et al. 2009 convention.
Elastic. (2024). "Weighted Reciprocal Rank Fusion (RRF) in Elasticsearch." Elastic Search Labs engineering blog. The extension to per-retriever weights for cases where not all retrievers are equally trustworthy.
LangChain. (2024). "Advanced Retrieval: Latency Commentary." LangChain tutorials. Quantitative practitioner estimate: 100-300ms of added pre-retrieval latency for multi-query, bounded by the slowest call rather than their sum.

Query Transformation BM25 HyDE Multi-Query Retrieval RAG Reciprocal Rank Fusion Information Retrieval