← All Articles

The Amortization Assumption

RAG is the most-taught pattern in production LLM systems. It is also the wrong default for the workload most people actually have: a handful of PDFs, ten minutes of questions, never opened again.

Week 5 of this course teaches Retrieval-Augmented Generation as an escalation ladder. The lexical floor (Classic Search) comes first, then the cheapest production RAG (The Simplest Possible RAG: Elasticsearch BM25 plus a single LLM call), then the dense-vector stack when measurement proves the cheaper path is insufficient. The ladder is correct for any workload that has a query stream long enough to justify building a retrieval system at all.

That qualifier is doing a lot of work. The ladder presumes there is a corpus worth indexing, served to users whose questions return often enough to amortize the engineering investment. That workload is real. It is also not the workload most readers of this article actually face when they reach for an LLM and a stack of documents.

The dominant interactive document workflow on the consumer surface in 2026 is something else entirely. A user receives a handful of PDFs in their inbox. They drop them into Claude Projects, NotebookLM, or ChatGPT Projects. They ask five or ten questions over the next ten minutes. Then they close the tab and never touch those documents again.

This is the regime where the cost story for RAG quietly inverts, even at the bottom rung of the ladder. It is also the regime where a peer-reviewed academic paper has now given the alternative pattern a name. This article is about that pattern, the assumption it exposes in Week 5's framing, and the honest decision framework that sits underneath "which rung of the ladder is right for this workload?"

. . .

The Ladder Week 5 Teaches

Week 5 walks an escalation ladder rather than a single pipeline. The bottom rung is BM25 keyword retrieval served by Elasticsearch, indexed once and queried at sub-100 ms against hundreds of millions of postings. One rung up is the same retriever plus a single LLM call: forty lines of Python, no embeddings, no second index. The next rung up is dense-vector retrieval, with an embedding model, an ANN index, and a chunking strategy. The top rung is hybrid retrieval with reciprocal rank fusion plus a reranker. Each step adds quality and complexity, and Anthropic's own Contextual Retrieval data (35% / 49% / 67% failure-rate reductions as embeddings and a reranker are added on top of BM25) gives the empirical schedule for when to climb each step.

The economics, less obviously, depend on a hidden assumption that the ladder shares at every rung. Every cost argument for RAG presumes a query stream. You pay an indexing cost once (cheaper at the BM25 rung, more expensive at the vector rung). You pay a per-query retrieval cost on each lookup. The total cost is dominated by the per-query terms only when the query count is large enough to swamp the fixed setup costs.

This is the amortization assumption. RAG looks cheap when fixed costs (index, embed if you bothered, maintain) are amortized across many queries. The cost-per-query goes down as query volume goes up. At ten thousand queries against a stable corpus, RAG is dramatically cheaper than stuffing the corpus into the context window on each call. At ten thousand queries per day, the comparison is not even close.

The published cost ratios for high-volume workloads are dramatic. A well-designed RAG pipeline retrieves five to twenty chunks totaling two to ten thousand tokens. A long-context call sends the entire corpus, every time, and pays for it every time. At enterprise query volumes, the ratio is over a thousand to one in RAG's favor.¹

None of this is wrong. It is a faithful description of the workload that justified building RAG in the first place: search over a slow-changing knowledge base accessed by many users many times. That workload exists, and Week 5's curriculum prepares students to build for it at whichever rung of the ladder their measurements support.

The question this article asks is what happens when the workload does not look like that at any rung.

. . .

The Use Case That Breaks the Assumption

Consider the workflow that has become routine for knowledge workers in the past eighteen months. A lawyer receives three contract drafts the night before a meeting. A researcher gets four PDFs from a colleague who said "tell me what's in these." A product manager downloads a 50-page competitor whitepaper and wants to skim the architectural arguments before standup.

Every one of these sessions has the same shape. A small bounded set of documents, on the order of three to ten PDFs. A short interactive period, on the order of five to fifteen minutes. A handful of questions, on the order of three to fifteen prompts. Then the session ends, the documents are rarely revisited, and there is no second user.

Call this the ephemeral document search regime. It is not a corner case. It is the dominant interactive-AI document workflow on the consumer surface today. Claude Projects, NotebookLM, ChatGPT Projects, and Gemini context cache all market the same workflow as their primary use case.²³

What happens to the amortization assumption in this regime? The fixed costs of RAG (chunking, embedding, indexing, storage) get paid once and divided across five to fifteen queries. The denominator that made RAG look cheap at enterprise scale has shrunk by three orders of magnitude. The per-query cost advantage of RAG narrows to a thin margin or vanishes entirely once you account for setup overhead.

Worse, the engineering complexity of a production-grade RAG store does not shrink with the query count. You still need an embedding model, a chunking strategy, a vector store, retrieval evaluation, and an integration layer. And once the index exists, it is not done. As the corpus grows, the empirical distribution of vectors changes shape. Under a fixed embedding pipeline, the vectors themselves do not move (the embedding model is a deterministic function and old documents stay at the coordinates the model originally assigned them); what changes is the population around them. Regions that were sparse fill out, new topics emerge in previously thin space, and cluster centroids shift toward newer content. The retrieval calibration you had tuned for the original distribution no longer fits the new one as cleanly. Similarity thresholds need recalibrating. Index structures (HNSW graphs, IVF partitions) need rebalancing as the population grows. And the pipeline does not stay fixed. When the embedding model itself ships an updated version, when preprocessing changes, when the model is fine-tuned on domain data, or when the vector store retrains its quantization codebook, every stored vector is now in a different geometry from any newly-embedded text. That is "embedding drift" in the technical sense, and the fix is to re-embed the entire corpus through the new pipeline. None of this is degradation against some "correct" baseline. It is calibration work, ongoing for as long as the store exists.

What changes as a RAG corpus grows is the distribution around the vectors, not the vectors themselves. The same embedding model produces two snapshots of the same store: existing vectors stay where they were, new vectors fill out the population, and neither snapshot is more "correct" than the other.

The original five vectors in each topic stay where the model first put them. New documents add to the population, and a fourth topic emerges in space the original corpus did not occupy. Adding documents does not corrupt the embedding space, because the vectors do not move and the population grows around them. The ongoing engineering work is recalibration to the new distribution, not cleanup of an old one.

For a ten-minute Q&A session, none of this engineering investment ever gets recovered. The recalibration cycles, the index rebuilds, the model-version migrations, the eval pipelines: all of it is paying off a future query stream that will not arrive.

The question is no longer whether RAG is correct in general. It is whether RAG is correct here, for documents that arrive in the user's lap and depart from their attention before any vector index could pay for itself.

. . .

The Cost Math Has Quietly Inverted

The single change that undermines "RAG by default" for ephemeral workloads is prompt caching. Anthropic's pricing structure is precisely tuned to this case, even though the documentation never frames it that way.

The pricing has three components. A cache write costs 1.25x the base input token price, paid the first time the documents enter the context. A cache read costs 0.1x the base input token price, paid every subsequent time those same documents are referenced inside the cache window. The 5-minute TTL is the default; a 1-hour TTL is also available at 2x write.⁴

The break-even math is striking. With the 1.25x write multiplier and the 0.1x read multiplier, a single cache read pays back the write premium. A second read produces a meaningful saving. By the third read, the cached approach is dramatically cheaper than re-sending the documents from scratch.

An ephemeral session that fires five queries over ten minutes hits the cache four times after the initial write, well past the break-even point. A session that fires fifteen queries hits it fourteen times. The pattern Anthropic describes as a general optimization is, in practice, the productized form of "load the documents once, ask many questions, throw it all away."

The latency story tells the same shape. Anthropic reports up to 85% latency reductions for cached prompts, with a 100,000-token book example dropping from 11.5 seconds per response to 2.4 seconds. The difference between conversational and broken, on a workload where the user is sitting at the keyboard waiting.

The crossover analysis published by MindStudio for a 200,000-token corpus is illustrative. A standard RAG query that retrieves five chunks costs roughly $0.012. A long-context query without caching costs roughly $0.60. A long-context query with caching costs roughly $0.06. Without caching, long-context is fifty times more expensive per query than RAG. With caching, it is five times more expensive.⁵

Caching is what makes the comparison interesting. Without it, RAG wins by fifty to one; with it, the gap collapses to a few cents per session, well within the noise of managed-RAG fixed costs.

Five times more is a real difference. It is also fifty times less than the headline "thousand to one" ratio that anchors the standard RAG argument. For a ten-minute, five-query session, the totals work out to roughly $0.06 with RAG, $0.30 with cached long-context, and $3.00 without caching, so the absolute difference between cached long-context and RAG is about 24 cents, less than the setup cost of a managed RAG. The cost gap that makes RAG a categorical winner at enterprise scale becomes a rounding error at the human scale of one user with a few PDFs.

Meanwhile, the managed-cloud RAG side of that comparison still has fixed costs the analysis does not show: the embedding API call to chunk the corpus, the vector store provisioning, and the setup time. None of those amortize over five queries. The 24-cent advantage evaporates the moment you account for the work needed to capture it.

. . .

The Heuristic for Week 5

Will this corpus be queried by many users many times, or by one user over a few minutes?

If many users, many times, the amortization assumption holds. The embedding cost, the index cost, and the drift-management cost are paid against a query stream long enough to repay them. Build the RAG pipeline. The standard Week 5 advice applies.

If one user, one short session, the amortization assumption fails. The fixed setup costs never get amortized, and the engineering investment never returns. CAG with prompt caching, or a Self-Route hybrid, is the right reach. This article's alternate view is exactly that case.

Cache-Augmented Generation

The pattern this article has been describing has a name in the academic literature now. Chan, Liu, and colleagues introduced it as Cache-Augmented Generation, or CAG, in a paper presented as a short paper at the ACM Web Conference 2025. The title is provocative: "Don't Do RAG: When Cache-Augmented Generation Is All You Need for Knowledge Tasks."⁶

The CAG mechanism is direct. Preload all relevant documents into the model's extended context window. Cache the runtime parameters of that loaded state. Run queries against the cached state without any retrieval step. The paper's claim is that for "applications with a constrained knowledge base where documents are of a limited and manageable size," CAG eliminates retrieval latency and minimizes retrieval errors while preserving context relevance.

The phrase "limited and manageable size" is doing a lot of work, and the modern frontier-model context window is what makes that phrase generous. A 1M-token window is not a marketing artifact; it holds about 1,500 PDF pages or roughly six average books. The handful-of-PDFs case occupies a small fraction of that capacity. Common document loads plotted against the modern Claude or Gemini frontier-model window, each bar starting at the same origin, look like this:

A "handful of PDFs" occupies under a third of the window, and the rest is headroom for the conversation. The 1M-token window dwarfs the document loads users actually drop into it, and most ephemeral sessions occupy under 30% of available capacity.

Read the constraint carefully. "Constrained knowledge base" and "limited and manageable size" are technical hedges, but they are also a precise description of the ephemeral document case this article opened with. A handful of PDFs is a constrained knowledge base. Three contract drafts are of limited and manageable size. The academic literature has caught up to what users have been doing for over a year, and given the pattern a label.

The naming matters because Week 5 currently has no vocabulary for what its own students are already doing with Claude Projects and NotebookLM. They drop a few PDFs into a chat interface, ask questions, and get answers. The mechanism is closer to CAG than to RAG, even though the curriculum does not describe it that way. Adopting the term, even just as a glossary entry, closes the gap between course vocabulary and student experience.

. . .

Privacy and Locality

The cost analysis above quietly assumes that both RAG and CAG run against a frontier-model API. For most consumer applications and many enterprise prototypes, that assumption holds. For the lawyer's contract drafts, it does not.

RAG has a property that long-context CAG cannot offer: the documents do not have to leave the user's machine. A small embedding model running locally, a SQLite-backed FAISS index in a temp directory, and a local generation model (or a remote call with a redacted payload) keeps the sensitive content inside the user's trust boundary. The questions are answered without the underlying PDFs ever crossing the network.

CAG over a 1M-token frontier window does not have this option. The documents must enter the cloud-hosted model's context on every session. The major providers all log API inputs for safety review and abuse detection by default; some retain inputs for training unless an enterprise contract explicitly opts out. For three contract drafts the night before a meeting, this is a privacy delta the cost-per-query analysis does not capture and cannot fix.⁷

This is the variable the decision framework most often gets wrong. The right pattern depends not only on corpus size and query volume, but on whether the documents can leave the room. When they cannot, RAG's locality advantage matters more than any token-economics ratio. The full-local RAG stack (sentence-transformers embedding, FAISS index, llama.cpp generation) has been a reliable engineering pattern for two years now, and for many enterprise legal, medical, and defense workloads it is the only option on the table.

. . .

Where Long Context Still Fails

The case for CAG is not that long context windows are a free lunch. They are not. The published failure modes are real, well-characterized, and important to teach alongside the cost crossover.

The first failure mode is the "lost in the middle" effect identified by Liu et al. and now familiar to anyone who has worked with long-context models for more than a few weeks. Performance is highest when relevant information sits at the beginning or end of the context. It degrades, often sharply, when the relevant information is buried in the middle, even for models explicitly marketed as long-context.⁸

The second failure mode is recall ceiling. Zilliz reports that Gemini 1.5 maintains nominal recall capability all the way out to 1M tokens, but average recall in practical use hovers around 60%. Six in ten relevant facts make it through to the model's working representation. This is not a number a user would tolerate from a search engine. It is the number we are quietly accepting from "stuff the documents into the window."⁹

The third failure mode is U-shaped quality decay. Databricks Mosaic Research benchmarked twenty models and found that most show a U-curve: accuracy first rises with longer context, then peaks, then falls. Specific thresholds: GPT-4 starts to decrease after 64,000 tokens. Llama-3.1 405B degrades after 32,000 tokens. Mixtral 8x7B starts dropping at just 4,000 tokens. The frontier models with the cleanest scaling curves up to 100,000 tokens (o1, GPT-4o, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro) are a specific named set, not the whole field.¹⁰¹¹

The fourth failure mode is the one most likely to ambush an unprepared user. At long contexts, Claude 3 Sonnet's copyright refusal rate climbed from 3.7% at 16,000 tokens to 49.5% at 64,000 tokens. Half the queries refused outright. DBRX-instruct's instruction-following failure rate rose from 5.2% at 8,000 tokens to 50.4% at 32,000. The behavior is not gradual quality decay; it is a categorical failure that kicks in when the document load crosses a threshold no user is informed of.

The fifth failure mode bears on debuggability. When RAG retrieves the wrong chunks, the user can inspect the chunks. The provenance is right there. When CAG produces a wrong answer, the user has no chunk-level evidence to consult. The provenance metadata that Week 5 introduces (source, confidence, timestamp) becomes harder to attach when there are no discrete retrieval units to attach it to. This is solvable, but it is not solved by default.

The honest summary: long context is not a substitute for retrieval in every case. It is a substitute for retrieval in the case where the corpus is small enough to fit, the query session is short enough to amortize the load cost, and the user can tolerate a recall ceiling and the occasional refusal. That is, in fact, the ephemeral PDF case. For most other cases, the failure modes argue for keeping RAG in the toolbox.¹²

. . .

The Synthesis: Self-Route

Li and colleagues at Google DeepMind published the strongest synthesis at EMNLP 2024 in a paper titled "Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach." Their headline finding is simple and worth quoting directly.¹³

"When resourced sufficiently, LC consistently outperforms RAG in terms of average performance. However, RAG's significantly lower cost remains a distinct advantage." That is the actual frontier: long context wins on quality, RAG wins on cost. The two patterns are not interchangeable; they trade against each other on the axis the user cares about most.

The paper's contribution beyond benchmarking is a hybrid mechanism called Self-Route. The idea is straightforward: for each query, run RAG first, then show the retrieved chunks to the model and ask whether they are sufficient to answer the question. If yes, use the RAG answer; if no, fall back to the full long-context call. The model decides per-query, by self-reflection, which pattern is appropriate.

The empirical result is the load-bearing line. With Self-Route, GPT-4o uses 61% of the tokens it would have used in pure long-context mode while matching long-context quality. Gemini 1.5 Pro uses 38.6%. The hybrid reaches near-LC quality at near-RAG cost for the substantial subset of queries where retrieval was already enough.

For the ephemeral document case, Self-Route reframes the choice. You do not have to commit to "always RAG" or "always CAG." You commit to a routing rule, and the model itself participates in routing each query. The amortization assumption is no longer load-bearing because the cost-per-query falls out of the routing distribution rather than the corpus size.

Self-Route turns the binary RAG-versus-long-context choice into a routing distribution: the model handles the queries it can with cheap retrieval and only escalates when retrieval is insufficient.

. . .

Good Enough, and the Long Climb to Great

One thing this article has so far understated is how cheap "good enough" RAG can be. A working retrieval pipeline does not require a managed vector database, a tuned chunking committee, or a re-ranker. It requires a few hundred lines of Python, a small embedding model that runs in seconds, and an in-memory index. For an ephemeral session, a good-enough RAG can be stood up in under a minute and thrown away with the documents. The "engineering investment never gets recovered" framing earlier in this article is true only when you treat RAG as a heavyweight production deployment. A laptop-grade RAG is something else entirely.

This is the same shape as traditional search. Elasticsearch returns useful results out of the box on most corpora, and Lucene-based retrieval has been doing this for over twenty years. What separates "useful" from "production-grade" is everything that comes after the first deployment: query rewriting, synonym expansion, faceting, freshness boosting, evaluation harnesses, and A/B infrastructure. The same arc applies to RAG. The first day's index works; the next six months are spent on the climb from good-enough to great.

The boundary between traditional search and RAG was always fuzzy and is increasingly absent in production. Modern Elasticsearch ships dense vector indexing and kNN search as native features, with hybrid scoring that combines keyword (BM25) and embedding similarity by default.¹⁴ Most enterprise RAG stacks are, structurally, search engines with embedding-based ranking added on top. The "RAG vs traditional search" framing flatters the novelty of RAG and obscures the continuity.

Two implications follow. First, an ephemeral RAG built on commodity tools (a SQLite index, a local embedding model, a quick BM25 fallback) is a perfectly reasonable choice for short-lived workloads where privacy or compliance rules out CAG. The cost argument earlier in this article applies to managed-cloud RAG specifically, not to RAG as a category.

Second, and harder: the climb from good-enough to great is real, and CAG does not have an analog. CAG either works for the question or it does not. RAG has a multi-year tradition of measurable iterative improvement, with shared evaluation harnesses (BEIR, MTEB) and decades of search-quality literature to inherit from. The amortization assumption is not just about per-query cost; it is also about whether the workload is worth iterating on. Sometimes it is. The decision framework should leave room for that.

. . .

A Decision Framework

The framing this article wants to leave the reader with is not "RAG is dead." It is "RAG is one of four patterns, and the choice depends on three variables, plus a fourth that overrides them." The three are corpus size, query volume, and document persistence. The fourth is content sensitivity, which when present forces locality regardless of the other three. The four patterns differ in which rung of the Week 5 escalation ladder they sit on, plus the two patterns that step off the ladder entirely.

Corpus	Queries	Persistence	Right Pattern
Large (10k+ docs)	High (1k+/day)	Stable	Vector RAG (full stack)
Bounded or moderate	Moderate	Stable	BM25-only RAG
Bounded (under window)	Low (under 50)	Ephemeral	CAG with prompt cache
Bounded	High	Stable	Self-Route hybrid
Large	Low	Either	BM25-only RAG (cost dominates)
Bounded	Variable	Ephemeral	BM25-only RAG or grep prefilter
Sensitive content	Any	Any	Local RAG (locality forces the pattern)

Corpus size and query volume set the cost regime, persistence sets the amortization horizon, and sensitivity vetoes cloud CAG. The Vector RAG case is the top rung of Week 5's ladder. Large corpus, persistent index, many users, many queries, every per-query cent matters, the vocabulary gap between queries and documents is wide enough that BM25 alone falls short. Build the full stack: embedding pipeline, ANN index, reranker, the whole climb.

The BM25-only RAG case is the rung beneath it and the case Week 5 now teaches as the default starting point. Bounded or moderate corpus, recurring access, no measured evidence that vector retrieval is needed yet. The pipeline is forty lines of Python: an Elasticsearch index, a match query, top-k results, an LLM call. No embedding pipeline to maintain. No second index. The amortization assumption still holds, but at a much lower fixed cost than the Vector RAG case, which means the query-volume threshold for breaking even is correspondingly lower.

The CAG case is the one that steps off the ladder entirely. Bounded corpus, ephemeral session, low query count, content the user is comfortable sending to the cloud. The amortization assumption fails even at the BM25 rung, the engineering investment never returns, and prompt caching makes long context a competitive choice in absolute dollars. Skip the index, drop the documents into the model's context, ask questions, and close the tab.

The Self-Route case is the synthesis. Bounded or moderate corpus, recurring access, mixed query types. The model itself decides which pattern is right for each question. Cost stays close to RAG's floor for the queries that retrieval handles. Quality stays close to long-context's ceiling for the queries that need full context.

For the boundary case (bounded corpus, ephemeral access, but with structure that grep can exploit) there is a deterministic textual prefilter feeding a small context, with no embedding model and no Elasticsearch in the loop at all.¹⁵ Worth knowing about; this is what sits beneath even the BM25 rung when the corpus is small enough that an inverted index is overkill.

The privacy override sits outside the cost-quality plane. When the documents cannot leave the room, the cost of "the wrong pattern" is not measured in cents per query but in compliance violations and lost client trust. Local RAG (at whichever rung the local stack supports) is the only answer that can be given honestly.

. . .

What Week 5 Still Gets Right

None of this argues that the Week 5 curriculum should be discarded or rewritten. The escalation-ladder framing the week already teaches (BM25 floor, then BM25 plus LLM, then vectors when measurement demands it, then hybrid plus reranker at the top) is the right structure. This article adds the question that sits underneath the ladder: does the workload justify being on the ladder at all? When the answer is no, the patterns are CAG, Self-Route, and grep-prefilter, all of which complement the ladder rather than replacing it.¹⁶

Three Week 5 lessons survive the reframing intact and become more important, not less, in the broader pattern landscape:

Retrieval-quality measurement still matters. Self-Route is a RAG layer underneath a long-context fallback; the precision and recall of that layer determine how often the model can route to the cheaper path. The closed BM25 evaluation loop the week teaches is doing real work in every pattern except pure CAG.
Provenance metadata matters more. When CAG is in play, the natural chunk-level provenance disappears. The four-field schema the week introduces (source, confidence, timestamp, agent_id) becomes the design problem the user must solve, not the design problem the system solves for them.
Production-perspective discipline transfers. Latency budgets, drift detection, and cost management all apply to CAG and Self-Route as much as to RAG at any rung. The instrumentation surfaces are different; the operational discipline is the same.

The vocabulary gap the article was originally written to fill is now mostly closed: Week 5 names BM25, the simplest-possible-RAG pattern, and the escalation ladder explicitly. What this article still adds is the amortization horizon as the variable that decides whether to step onto the ladder at all, plus the named alternatives (CAG, Self-Route, locality as a hard constraint) for when the horizon never closes.

. . .

Closing

RAG is not dead. It is also not the default. The default for any document workflow is the pattern whose amortization horizon matches the workload's lifetime, whose locality matches the data's sensitivity, and whose engineering investment matches how far up Week 5's escalation ladder the workload deserves to go. For a 50-million-document enterprise corpus queried by ten thousand users daily, that pattern is Vector RAG at the top of the ladder. For three contract drafts on a Tuesday afternoon, it is one of several other things, with the choice depending on whether those drafts can leave the room.

The vocabulary to describe the alternatives exists in peer-reviewed form, the pricing structures from at least two frontier providers are tuned to support them, and the product surfaces that knowledge workers actually use already implement them under the hood. The curriculum has caught up to the ladder. The decision this article still asks the reader to make is the one before the ladder: whether the workload in front of you has a query stream long enough to justify climbing any rung at all.¹⁷

Now that you have learned RAG and its escalation ladder, here is the question that sits underneath every rung: most of the time, you may not need to climb at all; some of the time, you do not get to choose; and all of the time, the work that separates good-enough from great is exactly the work the search community has been doing for two decades.

. . .

References

Annotations, contextual quotes, and grounding for each numbered reference in this article live on the companion sources page.

Redis. (2025-2026). "RAG vs Large Context Window: Real Trade-offs for AI Apps."
Atlas Workspace. (2025-2026). "NotebookLM vs Claude Projects: Side-by-Side Feature Comparison."
Rajendran, H. (2024-2025). "Skip the RAG Workflows with Gemini's 2M Context Window and the Context Cache." Google Cloud Community on Medium.
Anthropic. (Current). "Prompt Caching." Claude API Documentation.
MindStudio. (2025-2026). "What Is Flat-Rate Long-Context Pricing? How Anthropic Changed the Economics of RAG."
Chan, B. J., Chen, C., Cheng, J., & Huang, H. (2024). "Don't Do RAG: When Cache-Augmented Generation Is All You Need for Knowledge Tasks." arXiv preprint, accepted as short paper at WWW 2025.
Anthropic. (Current). "Privacy Policy." Anthropic legal documentation.
Liu, N. F., et al. (2024). "Lost in the Middle: How Language Models Use Long Contexts." Transactions of the Association for Computational Linguistics.
Zilliz. (2024-2025). "Will RAG Be Killed by Long-Context LLMs?"
Databricks Mosaic Research. (2024). "Long Context RAG Performance of LLMs." Databricks Engineering Blog.
Leng, Q., et al. (2024). "Long Context RAG Performance of Large Language Models." arXiv preprint.
Yu, T., et al. (2024). "In Defense of RAG in the Era of Long-Context Language Models." arXiv preprint.
Li, Z., Li, C., Zhang, M., Mei, Q., & Bendersky, M. (2024). "Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach." EMNLP 2024 Industry Track.
Elastic. (Current). "k-nearest neighbor (kNN) search." Elasticsearch Reference Documentation.
AkitaOnRails. (2026). "Is RAG Dead? Long Context, Grep, and the End of the Mandatory Vector DB."
LlamaIndex. (2024-2025). "Long Context RAG: New Architectures and Tradeoffs."
Yu, T., Xu, A., & Akkiraju, R. (2025). "Long Context vs. RAG for LLMs: An Evaluation and Revisits." arXiv preprint.

RAG Long Context Cache-Augmented Generation Prompt Caching Privacy and Locality Decision Frameworks