Sources

Grounding, citations, and further reading for The Amortization Assumption.

All of this is optional. These are the sources used to write the article, shown here as grounding for the research behind the argument. Nothing on this page is required reading.

The article is self-contained. This page exists so the work is properly cited, and so anyone who wants to go deeper on a specific claim knows where to look.

About the Sources

Chan et al.: Cache-Augmented Generation (WWW 2025)

Chan, Brian J., Chao-Ting Chen, Jui-Hung Cheng, and Hen-Hsen Huang. ACM Web Conference 2025 (short paper).

The paper that names the pattern this article is about. Argues that for "constrained knowledge bases of limited and manageable size," preloading the documents into the model context and caching the runtime parameters is a simpler, faster, and often more accurate substitute for retrieval. Available at arxiv.org/abs/2412.15605.

Li et al.: RAG or Long-Context LLMs? A Comprehensive Study and Hybrid Approach (EMNLP 2024)

Li, Zhuowan, Cheng Li, Mingyang Zhang, Qiaozhu Mei, and Michael Bendersky. EMNLP 2024 Industry Track.

Establishes the cost-quality frontier between RAG and long-context approaches and proposes Self-Route as a routing-based hybrid. The paper's headline result, that long context wins on quality and RAG wins on cost, is the central organizing finding for this article's three-pattern framework. Available at arxiv.org/abs/2407.16833.

Liu et al.: Lost in the Middle (TACL 2024)

Liu, Nelson F., et al. Transactions of the Association for Computational Linguistics, 2024.

The canonical citation for the long-context recall U-curve. Already referenced in Week 5's chunking-strategies reading description. Useful here as the empirical reason "drop the documents into the window" can fail even when the documents fit. Available at arxiv.org/abs/2307.03172.

Databricks Mosaic Research: long-context RAG benchmark

Leng et al., 2024. Databricks engineering blog and companion arXiv paper.

The most rigorous published benchmark of long-context RAG performance across twenty open and commercial models. The blog post provides accessible thresholds (where each model's quality begins to degrade); the arXiv paper provides the methodology and full numbers. Both are cited together throughout this article. Blog at databricks.com/blog/long-context-rag-performance-llms; paper at arxiv.org/abs/2411.03538.

Anthropic: Prompt Caching documentation

Claude API documentation, current as of 2026.

The pricing structure that makes ephemeral CAG economically competitive. The 1.25x cache-write multiplier and 0.1x cache-read multiplier on the 5-minute TTL together produce a break-even point of one cache read, which is structurally tuned to interactive document Q&A sessions. Documentation at platform.claude.com/docs/en/build-with-claude/prompt-caching.

Elastic: Elasticsearch kNN search reference

Elastic Documentation, current.

The official reference for Elasticsearch's dense vector indexing and kNN retrieval features. Cited in this article as evidence that the "RAG vs traditional search" boundary is largely artificial: the same engine the search community has been using for two decades now ships embedding-based ranking as a standard feature. Available at elastic.co/guide/en/elasticsearch/reference/current/knn-search.html.

The Pattern Week 5 Teaches

1Redis on RAG vs long-context cost ratios at enterprise volume ↩ Back to article

Redis published a clean side-by-side comparison: "RAG queries averaged a cost of $0.00008 per request, while full-context LLM queries averaged $0.10 per request, over 1,250 times more expensive." A well-designed RAG pipeline retrieves five to twenty chunks totaling two to ten thousand tokens; a long-context call sends the entire corpus on each invocation. The ratio is real and faithful to the workload that justified building RAG in the first place. It is also beside the point for the ephemeral case, where total spend in absolute dollars matters more than the ratio.

Redis Engineering Blog. Read the post

The Use Case That Breaks the Assumption

2Atlas Workspace on the Claude Projects hybrid behavior ↩ Back to article

Atlas Workspace's comparison piece captures the actual product behavior: "Claude uses a hybrid approach. When your project files are small enough, everything gets loaded directly into context, meaning it has your entire project in memory at once." Students arriving at Week 5 with the Claude Projects mental model are not wrong about how the tool works. The course is what is missing the vocabulary. The same source contrasts NotebookLM (strictly grounded retrieval) with Claude Projects (context-augmented generation that can also draw on training knowledge), a distinction Week 5's current vocabulary cannot make.

Atlas Workspace. Read the comparison

3Rajendran on skipping RAG with Gemini's context cache ↩ Back to article

Hemanand Rajendran's piece on the Google Cloud Community publication argues that "combining the long context window (2M tokens) with the Context caching feature, one can skip the lengthy process of building RAG pipelines in their applications." Important here because the same pattern (long window plus a context cache, no retrieval step) is documented for at least two of the three frontier providers. This is not a Claude-specific observation; it is the design pattern the major model APIs are converging on for ephemeral document workflows.

Rajendran, H. Google Cloud Community on Medium. Read the article

The Cost Math Has Quietly Inverted

4Anthropic's prompt-caching pricing structure ↩ Back to article

From Anthropic's documentation: "Cache write tokens at 1.25 times the base input tokens price for 5-minute caching, 2 times for 1-hour caching, and cache read tokens at 0.1 times the base input tokens price ... A cache hit costs 10% of the standard input price, meaning caching pays off after just one cache read for the 5-minute duration." Read that twice. The 5-minute TTL was effectively designed for ten-minute Q&A sessions over uploaded documents. Anthropic also reports up to 85% latency reductions for cached prompts, with a 100,000-token book example dropping from 11.5 seconds to 2.4 seconds.

Anthropic Claude API documentation. Read the docs

5MindStudio's flat-rate pricing crossover analysis ↩ Back to article

MindStudio's analysis is the cleanest published version of the crossover: "For a 200,000-token knowledge base, traditional RAG (5 chunks): ~$0.012 per query. Long-context: ~$0.60 per query. Long-context with caching: ~$0.06 per query." Without caching, long-context costs fifty times more per query than RAG. With caching, it costs five times more. At five queries per session that is a difference of about 24 cents, a rounding error compared to the engineering cost of building a managed-cloud RAG pipeline you will use once. The same piece names the conditions where long-context wins: documents fit, low-to-medium query volume, frequently changing content, full-document coherence required.

MindStudio. Read the analysis

Cache-Augmented Generation

6Chan et al.: the CAG paper at WWW 2025 ↩ Back to article

Chan, Chen, Cheng, and Huang frame CAG as "particularly suited for applications with a constrained knowledge base where documents are of a limited and manageable size." That phrase is technically about benchmark suites; it is, in practice, an exact description of "a handful of PDFs." The paper claims CAG "eliminates retrieval latency and minimizes retrieval errors while maintaining context relevance." A peer-reviewed paper now exists with a name for the workflow Week 5 currently has no vocabulary for. Adopting the term, even just as a glossary entry, closes the gap between course vocabulary and student experience.

Chan et al., WWW 2025. Read the paper

Privacy and Locality

7Anthropic's privacy and data-handling policy ↩ Back to article

The published Anthropic privacy policy describes the conditions under which API inputs are logged, retained, and reviewed. The relevant practical fact for this article: API content is retained by default for safety and abuse review, and the boundaries on training use depend on which product surface and which contractual tier the customer is on. Citing Anthropic specifically because their published policy is the clearest of the three frontier providers; the same general pattern (default-on logging, opt-out training use under enterprise contract) holds for OpenAI and Google. For sensitive documents, this is the variable that overrides the cost analysis: cloud CAG is not a private channel, regardless of how cheap it has become.

Anthropic. Read the privacy policy

Where Long Context Still Fails

8Liu et al.: lost in the middle ↩ Back to article

Liu et al. write, in TACL: "Performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts, even for explicitly long-context models." The "even for explicitly long-context models" qualifier is the part that should land in a Week 5 reading. Marketed window size is not the same as effective window size, and a CAG session that exceeds the model's effective window will silently lose accuracy on questions whose answers are buried mid-document.

Liu et al., TACL 2024. Read the paper

9Zilliz on the 60% recall ceiling ↩ Back to article

Zilliz reports that "Gemini 1.5 maintains recall capabilities all the way to 1M tokens, but the average recall hovers around 60%. If you want to make sure the model is actually using the context you are sending it, you are best off curating it first." Zilliz is a vector-database vendor and the framing reflects that interest. The recall figure is striking nonetheless: even when the documents fit, the model uses about six in ten of the relevant facts on average. The article cites this as the empirical ceiling on "stuff the documents in" approaches, not as a vendor argument for vector databases.

Zilliz Engineering Blog. Read the post

10Databricks: the long-context U-curve, with model-specific thresholds ↩ Back to article

Databricks Mosaic Research benchmarked twenty models and found that most show a U-curve. Specific thresholds from the blog: "GPT-4-0125-preview starts to decrease after 64k tokens. Llama-3.1-405b performance starts to decrease after 32k tokens. Mixtral-8x7b degradation begins at 4k tokens." The frontier models that maintain accuracy up to 100k tokens are a specific named set: o1, GPT-4o, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro. The same benchmark surfaced a non-obvious failure: at long contexts, Claude 3 Sonnet's copyright refusal rate climbed from 3.7% (16k) to 49.5% (64k), a categorical failure rather than gradual quality decay.

Databricks Engineering Blog. Read the post

11Leng et al.: the underlying long-context RAG benchmark paper ↩ Back to article

The arXiv paper underneath the Databricks blog. Provides the methodology and full numerical results behind the threshold claims, including the average retrieval recall plateau (0.788 at 8k tokens, 0.894 at 32k, 0.947 at 96k+) and the dataset-specific saturation behavior. Useful as the academic backing for the otherwise blog-only U-curve claim. Cited adjacent to ref 10 because the two are companion artifacts.

Leng et al., 2024. Read the paper

12Yu et al.: in defense of RAG, even with long-context models ↩ Back to article

Yu et al. argue against the prevailing trend by showing that "extremely long context in LLMs suffers from diminished focus on relevant information and leads to potential degradation in answer quality." They propose Order-Preserve RAG (OP-RAG), which retrieves and concatenates chunks in original document order rather than relevance-ranked order. The relevance to this article: even when the documents fit in a long-context window, retrieval-curated context can still beat full-corpus context on accuracy. The strongest published counter-argument to the "long context replaces RAG" framing.

Yu et al., 2024. Read the paper

The Synthesis: Self-Route

13Li et al.: Self-Route at EMNLP 2024 ↩ Back to article

From the abstract: "When resourced sufficiently, LC consistently outperforms RAG in terms of average performance. However, RAG's significantly lower cost remains a distinct advantage." The paper introduces Self-Route, a hybrid mechanism in which the model itself decides per-query whether the retrieved chunks are sufficient or whether to fall back to long-context. The empirical efficiency result is the load-bearing line: "GPT-4O uses only 61% tokens while achieving comparable performance with LC, and Gemini-1.5-Pro uses 38.6% of the tokens." The actual cost-quality frontier, with peer review.

Li et al., EMNLP 2024 Industry Track. Read the paper

Good Enough, and the Long Climb to Great

14Elastic on dense vector indexing and kNN search ↩ Back to article

Elastic's own reference documentation describes Elasticsearch's native dense-vector field type, kNN retrieval, and hybrid scoring (BM25 plus vector similarity). Cited here for a single load-bearing claim: the engine the search community has been using for over twenty years now ships embedding-based ranking as a built-in feature. The "RAG vs traditional search" boundary is increasingly artificial. Most enterprise RAG stacks are, structurally, search engines that have grown an embedding layer; the engineering disciplines that distinguish good-enough Elasticsearch from production-grade Elasticsearch (query rewriting, freshness boosting, evaluation harnesses) transfer almost entirely to RAG.

Elastic Documentation. Read the docs

A Decision Framework

15AkitaOnRails: grep + long-context as a fourth pattern ↩ Back to article

Akita argues that for many ephemeral document workloads, the right substitute for vector RAG is not long context but a deterministic textual prefilter (grep, ripgrep) feeding a small context. The point that matters for the decision framework: once you stop assuming RAG is the default, the design space opens up beyond the RAG / CAG / hybrid trichotomy. A grep-prefilter pattern with no embedding model in the loop is genuinely competitive when the corpus has structure that text matching can exploit.

AkitaOnRails. Read the post

What Week 5 Still Gets Right

16LlamaIndex on evolved RAG architectures ↩ Back to article

From LlamaIndex's piece: "While long-context LLMs will simplify certain parts of the RAG pipeline (e.g. chunking), there will need to be evolved RAG architectures to handle the new use cases that long-context LLMs bring along." LlamaIndex is a RAG-framework vendor and the framing reflects that interest. The argument is still useful: long context kills the chunking step but not the retrieval step, because in production the corpus is usually still bigger than any window. The article's three-pattern reframing is consistent with this view; it just names the bounded-corpus subcase explicitly.

LlamaIndex Engineering Blog. Read the post

Closing

17Yu et al.: long context vs RAG, evaluated and revisited ↩ Back to article

Yu, Xu, and Akkiraju revisit the long-context vs RAG question on more current benchmarks. Their framing closes the loop on the academic literature: "Long Context generally outperforms RAG in question-answering benchmarks, especially for Wikipedia-based questions. Summarization-based retrieval performs comparably to LC, while chunk-based retrieval lags behind. RAG has advantages in dialogue-based and general question queries." The carve-out is the part Week 5 should hear: chunk-based RAG (the variant Week 5 currently teaches) is the worst-performing of the three retrieval forms. The vocabulary to describe the alternative now exists in peer-reviewed form.

Yu et al., 2025. Read the paper