← All Articles

Context Rot

A model with a one-million-token window does not actually use one million tokens. As input grows, performance degrades in ways that are non-uniform, mechanism-bound, and visible across every frontier model tested. The promise of "just give it everything" has been measured, and it does not survive the measurement.

Cross-section of a grand palace facade revealing empty ruined interior with a single figure
A condition known as context rot

The phrase context rot entered industry vocabulary in July 2025, when Chroma published a technical report showing that frontier large language models do not use their context windows uniformly. The term is now used loosely to describe any failure mode that gets worse as input length grows. That looseness is a problem. Context rot is not one phenomenon. It is at least five distinct failure modes, each with its own mechanism, its own benchmark literature, and its own implications for how production systems should be built.

This article unpacks those failure modes against the academic record. The goal is not to argue that long-context models are useless. They are not. The goal is to make the gap between marketed capacity and effective capacity legible, so that engineers stop assuming that more tokens in means more information used.

. . .

The Chroma Study

The foundational paper is Context Rot: How Increasing Input Tokens Impacts LLM Performance, published as a Chroma technical report on July 14, 2025, by Kelly Hong, Anton Troynikov, and Jeff Huber. The study evaluates eighteen frontier large language models across a battery of controlled tasks. The evaluated set spans every major lab.

The experimental discipline is what makes the results credible. Every experiment holds task complexity constant and varies only input length. Temperature is set to zero where supported. Outputs are scored by a GPT-4.1 judge calibrated to better than ninety-nine percent agreement with human raters. The point of the design is to isolate length itself as the causal variable, so that any degradation observed cannot be attributed to harder questions, more ambiguous evidence, or different prompting.

The four headline findings:

  1. Every model degrades as input length grows, and the degradation curve is non-monotonic. Performance does not slide gracefully; it bounces.
  2. Distractors amplify with length. A single semantically similar distractor causes measurable accuracy loss. Four distractors are not four times worse than one; the curve is non-linear and model-specific.
  3. Haystack structure matters in a counterintuitive direction. Models perform better on shuffled haystacks than on logically structured ones, suggesting that global coherence misleads the model in ways that incoherent noise does not.
  4. Position sensitivity persists at every length. Targets placed early in the input are recalled at significantly higher accuracy than identical targets placed late.

The Chroma study uses three task families. The first is an extended Needle-in-a-Haystack at eight input lengths and eleven needle positions, with distractor configurations of zero, one, and four. The second is the LongMemEval conversational benchmark at roughly one hundred thirteen thousand tokens for the full prompt versus three hundred tokens for the focused version, run across three hundred and six conversational question-answering items. The third is a "repeated words" replication task where the model is asked to reproduce a sequence of repeated words with a single unique token inserted at varying positions, tested at word counts beyond ten thousand.

On LongMemEval, the gap between focused-prompt accuracy and full-prompt accuracy is largest for Claude Opus 4 and Sonnet 4. That is the diagnostic shape. It is not that the model cannot find the fact. The model is degraded by surrounding tokens it should be able to ignore. On the repeated-words task, accuracy is highest when the unique word sits near the front of the sequence, and the gap widens as total length grows.

. . .

The Marketed Million Tokens

From 512 tokens to 16M, 2017 to 2026.

Frontier models now advertise context windows of one million tokens. Google's Gemini 1.5 technical report publishes a recall curve that stays at or near one hundred percent from one thousand tokens out to one million, dipping only to 0.992 at ten million. OpenAI's GPT-4.1 announcement reports analogous behavior on its internal Needle in a Haystack evaluation. Those charts are not wrong. They answer a specific question, which is whether the model can find a templated phrase like "The special magic city number is" inserted into a synthetic essay. The answer is yes, at one million tokens.

The question is whether that finding generalizes. Independent benchmarks measure tasks that are only slightly harder, and the curves look very different. On Chroma's variant of Needle in a Haystack, which depth-averages across eleven needle positions and holds task complexity constant, GPT-4.1 drops from one hundred percent at five thousand tokens to nine percent at fifty thousand, then partially recovers to forty-five percent at nine hundred thousand. On NoLiMa, which forces associative reasoning by removing the lexical overlap that makes vanilla NIAH easy, GPT-4o falls from 99.3 percent at base length to 69.7 percent at thirty-two thousand. Claude 3.5 Sonnet falls from 87.6 percent to 29.8 percent over the same range.

Long-context retrieval accuracy, 1K to 1M input tokens Three benchmarks on three frontier models. The marketed line is the easiest possible test. 100% 75% 50% 25% 0% 1K 10K 100K 1M input length (tokens, log scale) Gemini 1.5 Pro, vanilla NIAH (Google tech report) GPT-4.1, Chroma harder NIAH (depth-averaged, n=11) GPT-4o, NoLiMa associative NIAH (32K cap)
The three lines do not measure the same thing. The marketed million-token number describes the easiest possible long-context task; everything else sits below it.

Interactive: pick any combination of current frontier models →

The shape of these three curves carries the argument. The vendor line is flat. The harder NIAH bounces. The associative NIAH slides. All three are frontier models from labs with publicly stated long-context capabilities. The marketed million-token figure describes the easiest possible test. Any real workload (answer extraction over messy haystacks, multi-step reasoning, semantic association) lives on a curve that bounces or slides well below the marketed line.

Three caveats belong with this chart. First, the three lines use different models, because that is where the published full-resolution data lives. Google publishes the cleanest single-model curve out to one million tokens; Chroma's public CSV ships GPT-4.1; NoLiMa's headline model is GPT-4o. Second, Chroma's GPT-4.1 line is computed from a CSV with eleven samples per length point, so the sharpness of any single dip is partly small-sample noise; the qualitative bounce is consistent with their multi-model figures at higher N. Third, NoLiMa caps its public release at thirty-two thousand tokens. The blue line ends there because the benchmark ends there, not because the model recovers above it.

A 2025 EMNLP finding sharpens this further. Du, Tian, Ronanki and colleagues, in Context Length Alone Hurts LLM Performance Despite Perfect Retrieval, show that across five open- and closed-source models on math, question-answering, and coding tasks, performance degrades by 13.9 to 85 percent as input length grows even when the model can perfectly recite all relevant evidence and even when distractor tokens are replaced with whitespace. This rules out "the model could not find the fact" and "distractor density" as the sole causes. Length itself is causal, independent of content.

The gap between the green line and the other two is the gap between marketing and engineering reality. RULER's 2024 result names a similar number from a different angle: of seventeen long-context language models tested, only about half maintained satisfactory performance at thirty-two thousand tokens, despite every model advertising support for that length or more. BABILong's 2024 NeurIPS finding closes the loop: across twenty reasoning tasks, popular LLMs effectively use only ten to twenty percent of their context window.

. . .

Lost in the Middle

Chroma's position-sensitivity finding is not new. It is the most recent measurement of a phenomenon Stanford and Berkeley researchers documented two years earlier. Nelson Liu and colleagues published Lost in the Middle: How Language Models Use Long Contexts on arXiv in July 2023, and in the Transactions of the Association for Computational Linguistics in 2024. The paper runs two tasks, multi-document question answering and key-value retrieval, over varying positions of the relevant evidence.

The result is a U-shaped curve. Performance is highest when the answer sits at the beginning or the end of the input, and it drops sharply when the answer lies in the middle. This pattern holds even for models marketed as explicitly long-context, and it holds across the open and closed models tested. The implication is that "the model has the information" and "the model uses the information" are different statements. The first can be true while the second is false.

For practitioners, the operational consequence is unambiguous. When you assemble a retrieved context for the model, the most critical evidence belongs at the start or at the end, not in the middle. The middle of the window is the part the model most reliably under-uses. Almost two years after the original publication, RAG pipelines still routinely place the highest-priority chunks in chronological order, which means somewhere in the middle. That is a design choice that contradicts the published evidence.

. . .

Why the Curve Bounces Instead of Sliding

A smoothly monotonic decline would suggest a single dominant cause that scales with input length. The literature does not name one. It names at least six mechanisms that compound, and three of them are structurally non-monotonic. This is why the red line in the chart above bounces instead of sliding.

Oscillatory RoPE Math

Rotary position encoding is the dominant position-encoding scheme in modern transformers, used in Llama, Qwen, Mistral, and many GPT-class models. RoPE encodes relative position by rotating query and key vectors in two-dimensional subspaces by angles proportional to position. The dot product of two rotated vectors is a sum of cosines at multiple frequencies, and a sum of cosines does not decay smoothly with distance. It oscillates.

Two recent papers document this directly. Dai, Shan, Song, and Liang's 2025 paper on Hyperbolic Rotary Positional Encoding states that "Rotary Positional Encoding (RoPE) introduces oscillatory attention patterns that hinder stable long-distance dependency modelling," and proposes replacing the SO(2) rotations with Lorentz boosts in hyperbolic geometry specifically to enforce monotonic decay. A separate 2024 paper by Chen, Lv, Luan, Wang, and Liu observes that RoPE-trained models actually learn a U-shaped global attention pattern, not a monotonically decaying one, and identifies that learned U-shape as the key factor limiting RoPE's expressiveness and extrapolation.

This is the structural reason the curve bounces. At certain relative distances the attention coefficients ride a peak; at other distances they sit in a trough. Average accuracy across needle depths inherits the oscillation.

Undertrained Distant Positions

Even where the RoPE math would behave at a given distance, the model has to have learned good query and key weights for that distance during training. It generally has not. An, Zhang, Zhong, and colleagues' 2024 paper, Why Does the Effective Context Length of LLMs Fall Short?, documents a left-skewed frequency distribution of relative positions seen during pretraining and post-training. Long relative distances appear orders of magnitude less often than short ones. The paper attributes the effective-context-length shortfall (less than half of trained length on average) to this distribution alone. Their proposed fix, ShifTed Rotary position embedding, gains roughly ten RULER points on Llama 3.1 70B and Qwen2 72B without retraining by remapping inference-time positions onto well-trained ones.

This produces a position-specific training-data hole rather than a smooth decay. A needle whose depth lands in a well-trained bucket is recalled. The same needle a few thousand tokens later lands in an undertrained bucket and is lost. Aggregated across depths, the per-length mean bounces.

Softmax Attention Sinks

Guangxuan Xiao and colleagues at MIT published Efficient Streaming Language Models with Attention Sinks at ICLR 2024. The paper documents that the first few tokens in any input absorb a disproportionate share of attention mass, regardless of their semantic relevance. The mechanism is softmax. Attention weights are forced to sum to one across visible tokens, so when later tokens have nothing strongly relevant to attend to, attention is dumped on the first few visible tokens. Initial tokens are visible from every subsequent position because of autoregressive structure, and they appear at the start of every training sequence, so the model learns to treat them as a default attention basin.

A follow-up paper by Gu and colleagues in late 2024 establishes that this is a softmax artifact, not a universal property of attention. Sink emergence is a function of optimization, loss surface, and data distribution. Replacing softmax with sigmoid attention removes the sink up to roughly one billion parameters.

Sink dominance interacts with length non-linearly. At very short inputs the sink is small relative to the body. At moderate inputs it steals decisive attention from the body. At very long inputs the model often falls back to local-window behavior, which can be more reliable than the sink-dominated middle regime. This partly explains why some curves recover at the longest lengths instead of continuing to fall.

Prompt-Conditioned Recall Variance

Daniel Machlab and Rick Battle of VMware published LLM In-Context Recall is Prompt Dependent in April 2024. Their finding is that recall accuracy varies dramatically with how the question is phrased, even when the target fact is sitting in the context window. Memory in a language model is not a stable store. It is a re-derivation conditioned on every surrounding token. Change the prompt, and the model's apparent memory changes with it.

Empirically this means that even a clean controlled experiment with the same needle, same model, same length, and eleven depth positions sees variance across runs because the surrounding haystack differs. Part of the apparent bounce in any single low-N curve is statistical wobble. Most of the bounce in Chroma's higher-N multi-model figures is real signal, but the sharpness of any single dip in the public GPT-4.1 CSV (eleven samples per length) inherits some prompt-conditioned noise.

Decoding-Mode Transitions

The last mechanism is most visible on copy and repetition tasks. The Chroma report documents that GPT-4.1 begins inserting lowercase "san" inside sequences like "San Francisco San Francisco san Francisco" only at certain repetition counts. This is a decoding-mode transition. At some lengths the model maintains the copy loop; at other lengths it falls into a degraded attractor. The transition points are model-specific and produce sharp discontinuities in the curve.

This mechanism interacts with the position-interpolation literature. Many production models extend context through inference-time scaling rather than retraining, using techniques like position interpolation, NTK-aware scaling, or YaRN. At lengths near a scaling-transition boundary, behavior can change qualitatively in ways that look like a discontinuity on the chart.

Decomposition

The six mechanisms decompose cleanly by the shape of failure each one produces:

Six mechanisms behind context rot, decomposed by failure shape A smoothly declining curve would imply one cause. Three of these are structurally non-monotonic. Causes overall decline? Causes the curve to bounce? Oscillatory RoPE math YES YES Learned U-shape attention YES YES Undertrained distant positions YES YES Softmax attention sink YES PARTLY Prompt-conditioned recall variance NO PARTLY Decoding-mode transitions NO YES Sources: Dai et al. 2025; Chen et al. 2024; An et al. 2024; Xiao et al. 2024; Gu et al. 2024; Machlab and Battle 2024; Hong et al. 2025.
Three mechanisms drive bounce independently. The aggregate of all six is the curve.

A monotonic curve would imply a single dominant cause that scales smoothly with input length. There is no such cause. At least three of these six (oscillatory RoPE math, learned U-shape attention, undertrained distant positions) are structurally non-monotonic. The aggregate of all six is the bounce.

. . .

Beyond Literal Matching

A subtler failure mode is the model's reliance on lexical overlap between the question and the answer. Most Needle-in-a-Haystack-style benchmarks inadvertently reward that overlap. If the question asks "what color is the car?" and the needle says "the car is blue," the model can route attention through the shared surface form "the car."

Modarressi and colleagues at Adobe Research designed NoLiMa (No Literal Matching) to remove that shortcut. The benchmark, accepted at ICML 2025, rebuilds the haystack so the model must rely on latent semantic association to locate the needle. The result is dramatic. Across twelve state-of-the-art models claiming at least one hundred twenty-eight thousand tokens of context (including GPT-4o, Gemini 1.5 Pro, and Llama 3.3 70B), performance declines noticeably starting at two thousand tokens. At thirty-two thousand tokens, ten of the twelve models retain only half of their short-context performance. Even GPT-4o falls from a near-perfect baseline of 99.3 percent to 69.7 percent.

What this sharpens is the diagnosis. The "rot" in context rot is not a retrieval miss. The model is finding the right region of the input. The failure is in associative reasoning across that region, and the failure scales with how much non-literal inference the question demands.

. . .

Reasoning Degrades Independently of Distractors

One natural counter-argument is that bigger windows hurt only when they contain irrelevant noise. Give the model exactly the right information and length should not matter. Mosh Levy, Alon Jacoby, and Yoav Goldberg of Bar-Ilan University tested this directly at ACL 2024. Their paper introduces FLenQA, a controlled reasoning benchmark in which the same reasoning sample is padded to multiple lengths with different padding types and positions.

The findings are sharper than expected:

  1. Reasoning performance degrades long before the technical context limit is reached. There is no cliff at thirty-two thousand or one hundred twenty-eight thousand tokens. The slide begins at a few thousand.
  2. Degradation appears even when the padding consists of duplicated relevant text. Length itself, not just irrelevant content, drives the failure.
  3. The worst case is when the model must integrate evidence from two non-adjacent locations inside a long input. Multi-source reasoning at distance is the failure mode that compounds fastest.

This forecloses the optimistic counter-argument. The problem is not just that long contexts contain distractors. The problem is that long contexts themselves degrade the reasoning process, regardless of what fills them.

. . .

Multi-Hop and Structural Reasoning

Two further benchmarks confirm that the failure compounds when the reasoning task is more than retrieval. Google DeepMind's Michelangelo benchmark, by Vodrahalli and colleagues in late 2024, introduces a framework called Latent Structure Queries. The three tasks (Latent List, Multi-round Co-reference Resolution, and "I Don't Know") are designed to require the model to chisel away irrelevant context to reveal a latent structure. Across ten frontier models tested up to one million tokens of context, every model exhibits a sharp drop as reasoning complexity grows, even when raw retrieval is easy.

Yuri Kuratov and colleagues published BABILong at NeurIPS 2024. The benchmark embeds twenty reasoning tasks (fact chaining, induction, deduction, counting, list and set handling) inside long natural text up to ten million tokens. The headline result is that popular LLMs effectively utilize only ten to twenty percent of their context window, with performance falling rapidly as reasoning complexity rises. A vanilla retrieval-augmented baseline holds steady at roughly sixty percent on single-fact question answering regardless of haystack length, which strongly supports the operational claim that "appropriate context" beats "maximum context."

. . .

Long-Term Agentic Memory

Conversational and agentic systems hit context rot harder than single-shot QA. Wu and colleagues published LongMemEval at ICLR 2025. The benchmark tests five long-term memory abilities: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. It comprises five hundred carefully curated questions embedded in extensible user-assistant chat histories, simulating one hundred and fifteen thousand-token sessions in the small variant and up to one and a half million tokens in the medium variant.

The results are bracing. Long-context language models show thirty to sixty percent accuracy drops on the small variant compared to focused prompts. State-of-the-art commercial systems like GPT-4o only achieve thirty to seventy percent accuracy in a setting that is substantially simpler than the small variant. The failure modes cluster around temporal reasoning and knowledge update tasks. The model cannot reliably distinguish what the user said an hour ago from what the user said three sessions ago, and it cannot reliably update an inferred fact when the user explicitly corrects it.

This last point matters for the practical claim made in the recent infrastructure-vendor discourse on agent memory. When a video producer says that an agent "may treat stale and current as equal," LongMemEval is the benchmark behind that claim. There is no per-token authority field in a transformer. There is no internal mechanism that says "this token is from the user, that token is from the model's inference, this token is from a tool call." Everything in the window is just tokens, weighted by attention.

. . .

What Does Help

The literature does not say long contexts are useless. It says the operational discipline has to change. Three things consistently improve outcomes:

TacticWhat it doesEvidence
Place critical evidence at the start or end of the assembled context Mitigates the Lost in the Middle effect by routing the model's strongest attention to the most important content Liu et al. (2024); Chroma (2025)
Retrieve a small, curated bundle rather than dumping documents Avoids the non-linear distractor penalty and limits the reasoning-degradation effect from sheer length BABILong (2024); FLenQA (2024); LaRA (2025)
Treat the context window as engineered state, not a dump zone Forces explicit decisions about authority, freshness, and source attribution that the model cannot make on its own Anthropic (2025); LongMemEval (2025)
Three operational responses to context rot that the literature supports.

Anthropic has framed this discipline as context engineering, distinguished from prompt engineering. The argument, in their own writing, is that prompt engineering optimizes how you communicate a single instruction, while context engineering optimizes what tokens are in the window at all. The shift in vocabulary is small. The shift in practice is large. Context engineering treats the entire window (system instructions, tool definitions, retrieved chunks, conversation history, and the current user turn) as the unit of design.

The LaRA benchmark, presented at ICML 2025, gives the cleanest empirical handle on when to prefer retrieval-augmented generation over long context. Across twelve question-answering datasets, RAG and long-context approaches produced identical answers on roughly sixty percent of items. Long-context wins on whole-document reasoning tasks. RAG wins on retrieval-style queries. Accuracy drops by ten to twenty points or more when the relevant information sits in the middle of a long context. The headline conclusion is that there is no silver bullet. The two approaches are complementary, and the choice depends on the bundle of information the agent needs to do its job.

. . .

What This Means for Builders

The video and industry discourse around context rot tends to converge on the line "appropriate context, not maximum context." The literature behind that line supports it, but the line is too compact to be operationally useful. Here is a longer version with the receipts attached.

  1. Design the bundle before the database. Before choosing a vector store, a graph, or any other retrieval primitive, write down the exact set of fields your agent needs to do one specific task. The Chroma findings show that filling the window with semantically adjacent content actively hurts. Less is usually more, but only if "less" is the right less.
  2. Place the bundle, do not dump it. Liu's U-curve is two years old and still ignored by most production RAG pipelines. The highest-priority chunks belong at the front or the back of the assembled context, not in the middle. This is a one-line code change with a measurable accuracy lift.
  3. Stop trusting the marketed context length. RULER, NoLiMa, and BABILong all show that effective context is somewhere between ten and fifty percent of marketed context, depending on task. Build with the effective number, not the marketed number. If you do not know your model's effective number for your task, the first useful experiment is to measure it.
  4. Distinguish memory from inference in your own data, because the model cannot. A token your agent retrieved from a database, a token the model inferred during reasoning, and a token the user supplied directly all look identical to the next attention layer. If authority, freshness, or permissions matter to your task, you have to enforce them outside the model. The model will not enforce them for you.
  5. Measure degradation against your task, not against benchmarks. Every published benchmark is a synthetic proxy for what your agent actually does. The benchmarks establish that the phenomenon exists; they do not establish how it manifests in your domain. The practical move is to build a small evaluation set drawn from your real traffic and run it at multiple context lengths.
. . .

Honest Uncertainty

Several things remain unresolved in the public literature. They deserve to be named.

First, no single paper decomposes the observed degradation across all of its mechanisms at once. Attention sinks, RoPE position distribution, prompt-conditioning of recall, distractor interference, and coherent-haystack confusion are each documented in isolation. Their relative contributions to any given failure are not measured jointly. We do not know, for a given model and a given task, how much of the degradation is "the model is being pulled toward early tokens" versus "the model never saw enough long-distance relative positions during training."

Second, no benchmark currently scores models on their ability to weight an explicitly-marked authoritative source over an unmarked contradicting one. The video discourse about "preserving document hierarchy" is gesturing at a real need, but the academic literature has not yet operationalized it.

Third, Chroma's finding that models perform better on shuffled haystacks than on coherent ones has not been independently replicated at the time of writing. It is the most surprising result in the report and deserves a follow-up study with multiple coherence operationalizations.

Fourth, all of the benchmarks discussed here are essentially single-shot or conversational. There is no widely adopted benchmark for tool-using agents where the context window simultaneously contains system messages, retrieved chunks, tool outputs, and intermediate reasoning. That is the working environment of nearly every production agent in 2026, and the benchmark gap is real.

If you build agents in production, the rate at which this literature is published exceeds the rate at which most teams can read it. The honest position is that we are early in understanding a phenomenon that already governs the reliability of every long-running agent in deployment.

. . .

References

  1. Hong, K., Troynikov, A., & Huber, J. (2025). "Context Rot: How Increasing Input Tokens Impacts LLM Performance." Chroma Technical Report, July 14, 2025.
  2. Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). "Lost in the Middle: How Language Models Use Long Contexts." Transactions of the Association for Computational Linguistics, 12, 157-173.
  3. Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., & Ginsburg, B. (2024). "RULER: What's the Real Context Size of Your Long-Context Language Models?" COLM 2024.
  4. An, C., Zhang, J., Zhong, M., Li, L., Gong, S., Luo, Y., Xu, J., & Kong, L. (2024). "Why Does the Effective Context Length of LLMs Fall Short?" arXiv:2410.18745.
  5. Xiao, G., Tian, Y., Chen, B., Han, S., & Lewis, M. (2024). "Efficient Streaming Language Models with Attention Sinks." ICLR 2024.
  6. Machlab, D., & Battle, R. (2024). "LLM In-Context Recall is Prompt Dependent." arXiv:2404.08865.
  7. Modarressi, A., Deilamsalehy, H., Dernoncourt, F., Bui, T., Rossi, R. A., Yoon, S., & Schütze, H. (2025). "NoLiMa: Long-Context Evaluation Beyond Literal Matching." ICML 2025.
  8. Levy, M., Jacoby, A., & Goldberg, Y. (2024). "Same Task, More Tokens: The Impact of Input Length on the Reasoning Performance of Large Language Models." ACL 2024.
  9. Vodrahalli, K., Ontanon, S., Tripuraneni, N., Xu, K., Jain, S., Shivanna, R., Hui, J., Dikkala, N., Kazemi, M., Fatemi, B., et al. (2024). "Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries." Google DeepMind.
  10. Kuratov, Y., Bulatov, A., Anokhin, P., Rodkin, I., Sorokin, D., Sorokin, A., & Burtsev, M. (2024). "BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack." NeurIPS 2024 Datasets & Benchmarks Track.
  11. Wu, D., Wang, H., Yu, W., Zhang, Y., Chang, K.-W., & Yu, D. (2025). "LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory." ICLR 2025.
  12. Bai, Y., Lv, X., Zhang, J., Lyu, H., Tang, J., Huang, Z., Du, Z., Liu, X., Zeng, A., Hou, L., et al. (2024). "LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding." ACL 2024.
  13. Li, Z., Li, X., Liu, Y., Xie, H., Li, J., Wang, F., Li, Q., & Zhong, X. (2025). "LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs." ICML 2025.
  14. Anthropic. (2025). "Effective Context Engineering for AI Agents." Anthropic Engineering Blog.
  15. Reid, M., et al. (2024). "Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context." Google DeepMind. arXiv:2403.05530.
  16. OpenAI. (2025). "Introducing GPT-4.1 in the API." OpenAI announcement.
  17. Du, Y., Tian, M., Ronanki, S., et al. (2025). "Context Length Alone Hurts LLM Performance Despite Perfect Retrieval." EMNLP 2025 Findings.
  18. Dai, X., Shan, J., Song, Q., & Liang, Z. (2025). "HoPE: Hyperbolic Rotary Positional Encoding for Stable Long-Range Dependency Modeling in Large Language Models." arXiv:2509.05218.
  19. Chen, Y., Lv, A., Luan, J., Wang, B., & Liu, W. (2024). "HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation." arXiv:2410.21216.
  20. Gu, X., Pang, T., Du, C., Liu, Q., Zhang, F., Du, C., Wang, Y., & Lin, M. (2024). "When Attention Sink Emerges in Language Models: An Empirical View." arXiv:2410.10781.
Context Engineering Long Context LLM Memory Agent Architecture Retrieval Benchmarks