The Invisible Boundaries of AI Conversation
Every LLM operates within a fixed-size window of attention. What happens at the edges of that window, how models forget, and why the most expensive tokens are often the ones you never meant to send.
When you paste a long document into ChatGPT and ask a question about the ending, the model sometimes gives an answer that ignores everything in the middle.
This is not a bug.
It is a consequence of how attention mechanisms work under computational constraints. Understanding context windows means understanding the fundamental architecture of modern language models.
What Is a Context Window?
A context window is the maximum number of tokens a model can process in a single forward pass. Every piece of information the model considers must fit within this window:
- System prompts and instructions
- Conversation history (all previous turns)
- The current user message
- The model's own output as it generates
This is a hard boundary, not a soft preference. Tokens outside the window do not exist to the model. They are not "deprioritized" or "dimly remembered." They are gone.
Tokens Are Not Words
The first misconception about context limits involves counting. A 128K context window does not mean 128,000 words. Tokenization compresses and expands text in non-intuitive ways.
English text averages roughly 0.75 tokens per word. But this ratio varies dramatically:
Text Words Tokens Ratio ................................................................ "The cat sat on the mat." 6 6 1.00 "Antidisestablishmentarianism" 1 7 7.00 "def hello_world():" 1 5 5.00 "ä½ å¥½" 2 4 2.00 "https://example.com/api/v1/users" 1 11 11.00
Code is particularly expensive. A 500-line Python file might consume 3,000-4,000 tokens. URLs, technical jargon, and non-English text all tokenize inefficiently. When you paste code into a chat, you are spending context budget at roughly 5x the rate of prose.
Every major provider offers tokenization tools. For a deeper look at how these tokenizers evolved, see Breaking Text: A Brief History of Tokenizers.
↗ docs# OpenAI (tiktoken)
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode("Your text here")
print(f"Token count: {len(tokens)}")
# Llama (SentencePiece)
from transformers import LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b")
tokens = tokenizer.encode("Your text here")
print(f"Token count: {len(tokens)}")
For interactive exploration, the tiktoken documentation and various web-based token counters let you paste text and see exactly how it tokenizes.
The Quadratic Wall
Context windows have hard limits because of how attention works. Standard self-attention requires every token to attend to every other token. For a sequence of length N, this means N2 attention computations.
| Sequence Length | Attention Operations | Memory (approx) |
|---|---|---|
| 2,048 | 4,194,304 | 32 MB |
| 8,192 | 67,108,864 | 512 MB |
| 32,768 | 1,073,741,824 | 8 GB |
| 131,072 | 17,179,869,184 | 128 GB |
| 1,048,576 | 1,099,511,627,776 | 8 TB |
This is not just computational cost. It is memory cost. Every attention weight must be stored. A million-token context window requires terabytes of memory just for the attention matrices, even before model parameters.
The quadratic scaling explains why context windows grew slowly for years, then jumped dramatically when new architectures emerged.
See for yourself:
Each jump required either architectural innovation or massive hardware investment, often both.
How can providers like Google afford to offer such large context windows?
Looking at that table, a reasonable question emerges: if a 1M token context requires 8TB of memory with naive attention, how is Google offering 4M tokens in production? The quadratic math suggests 128TB of memory per query. No chip holds that.
The answer involves three things: distributed computation, algorithmic tricks, and brute force economics.
Ring Attention distributes the problem. Instead of one device computing N2 attention, the sequence spreads across hundreds of Tensor-Processing Units (TPUs) arranged in a ring topology. Each device handles a chunk and passes key-value pairs to its neighbors. What would require 128TB on one device becomes manageable portions across 256+ TPUs. The total computation is still O(N2), but no single chip needs to hold it all.
Flash Attention avoids materializing the full matrix. The critical insight: you do not need to store all N2 attention weights simultaneously. Flash Attention computes attention in blocks, processes each block, and discards it before moving to the next. Memory scales closer to O(N) rather than O(N2), at the cost of additional re-computation. You are trading compute cycles for memory savings.
Custom hardware makes the communication feasible. Ring attention only works if devices can pass KV pairs faster than the attention compute bottlenecks. Google's TPU pods have purpose-built interconnects optimized for exactly this communication pattern. Standard GPU clusters cannot do this efficiently at scale.
They pay the cost. A 4M token query is expensive. The latency is higher, the compute per query is massive, and API pricing reflects this reality. Google is not escaping O(N2); they are wealthy enough to afford it at scale, and they have built infrastructure specifically to absorb these costs.
The engineering is impressive. But the fundamental math has not changed. Quadratic scaling remains quadratic. The question is whether you can distribute, optimize, and pay your way through it.
The Attention U-Curve
Even within their context windows, models do not treat all positions equally. Research by Liu et al. (2023) documented a striking phenomenon: information placed in the middle of long contexts is retrieved less accurately than information at the beginning or end.
The study tested models with 20+ documents placed at various positions in the context. When the relevant document appeared first or last, accuracy exceeded 90%. When it appeared in the middle, accuracy dropped to 50-70%, depending on the model.
This has practical implications:
System prompts work well because they appear at the beginning of every context window. The model sees them with high attention.
Recent conversation matters more because it appears at the end. The model attends to it strongly.
Long documents struggle because their middle sections fall into the attention dead zone. A 50-page PDF may have its most important content at page 25, which the model effectively ignores.
Several approaches help combat the middle-attention problem:
Chunking with overlap: Break long documents into chunks that overlap at boundaries. If the key information falls in the overlap region, it may appear near the start or end of at least one chunk.
Strategic ordering: Put the most important information first or last. If you have multiple documents, order them so critical content avoids the middle positions.
Repetition: State key facts multiple times throughout the context. This increases the chance that at least one instance falls in a high-attention region.
Explicit references: Ask the model to quote from specific sections before answering. This forces it to scan for particular content rather than relying on general attention patterns.
What Happens at the Boundary
When a conversation exceeds the context window, something must be removed. Different systems handle this differently.
Truncation Strategies
Oldest-first truncation: The most common approach. Drop the oldest messages to make room for new ones. The system prompt typically stays, but early conversation history disappears.
Before truncation (10K context, 12K used): [System: 500] [Turn 1: 2000] [Turn 2: 2500] [Turn 3: 3000] [Turn 4: 4000] After truncation: [System: 500] [Turn 2: 2500] [Turn 3: 3000] [Turn 4: 4000]
Sliding window: Keep only the N most recent turns, regardless of token count. Simpler but wastes context if recent turns are short.
Selective truncation: Keep certain messages (system prompt, user preferences, key decisions) while dropping others. Requires metadata about message importance.
Summarization: Before dropping old messages, summarize them into a compressed form that captures key information in fewer tokens.
requested feature Y, agreed to Z"
The Summarization Trade-off
Summarization preserves more history but introduces risk. The summary is generated by the model itself. If the model misunderstands something early in the conversation, that misunderstanding gets baked into the summary and persists indefinitely.
There is also the question of what to include. Summaries necessarily lose detail. A fact that seemed unimportant when the summary was generated may become critical later.
Most production systems use a hybrid approach: summarize very old messages, keep recent messages verbatim, and preserve explicitly marked "important" messages regardless of age.
Try it yourself:
Effective Context vs. Nominal Context
A model's advertised context window is its theoretical maximum. Effective context is how much of that window actually contributes to output quality.
Several factors reduce effective context:
Attention dilution: As context grows, attention spreads across more tokens. Each individual token gets less attention weight. Information that would be captured in a 4K context might be missed in a 128K context with the same content buried among more tokens.
Retrieval degradation: The "lost in the middle" phenomenon means that nominal context growth does not translate to proportional retrieval improvement.
Prompt engineering overhead: System prompts, few-shot examples, and output formatting instructions all consume context budget before user content even enters.
A practical rule: expect effective retrieval to scale logarithmically rather than linearly with context size. Doubling the context window does not double your ability to retrieve arbitrary facts from it.
Context Window Architectures
Different approaches to extending context windows trade off speed, accuracy, and implementation complexity.
Standard Self-Attention
The original Transformer architecture. Every token attends to every other token. O(N2) computation and memory.
Strengths: Conceptually simple, well-understood, no information loss.
Weaknesses: Does not scale beyond ~32K tokens without massive hardware.
Sparse Attention (Longformer, BigBird)
Replace full attention with patterns: local windows (attend to neighbors), global tokens (special tokens attend to everything), and random connections.
Strengths: Reduces complexity to O(N), enables longer contexts on standard hardware.
Weaknesses: Pattern design requires tuning, some information pathways are severed.
Rotary Position Embeddings (RoPE)
Encode position through rotation matrices rather than learned embeddings. Models can extrapolate to longer sequences than seen during training.
Strengths: Better length generalization, efficient computation.
Weaknesses: Still requires attention computation, extrapolation has limits.
Ring Attention
Distribute the attention computation across multiple devices. Each device handles a chunk of the sequence and passes key-value pairs around a ring topology.
Strengths: Scales to millions of tokens with enough devices.
Weaknesses: Requires specialized infrastructure, latency increases with ring size.
State Space Models (Mamba, etc.)
Replace attention entirely with selective state space layers that process sequences in O(N) time without the quadratic attention matrix.
Strengths: Linear scaling, no context window in the traditional sense.
Weaknesses: Newer architecture, less battle-tested, quality trade-offs still being understood.
Token Economics
Context windows are not just technical constraints. They are economic ones. API providers charge per token, both input and output.
Cost structure (example pricing): Input tokens: $0.01 per 1K tokens Output tokens: $0.03 per 1K tokens A conversation that uses full 128K context: 128K input tokens = $1.28 + 2K output tokens = $0.06 = $1.34 per exchange A conversation with efficient 8K context: 8K input tokens = $0.08 + 2K output tokens = $0.06 = $0.14 per exchange
The 128K conversation costs nearly 10x more for the same output. And this cost compounds: every subsequent message in that conversation pays for the full context again.
This creates economic incentive to minimize context usage:
- Compress prompts: Remove unnecessary whitespace, use abbreviations in system prompts, eliminate redundant instructions.
- Summarize history: As conversations grow, compress older turns rather than carrying full verbatim history.
- Use retrieval: Instead of stuffing documents into context, retrieve only relevant chunks.
- Cache intelligently: Some providers offer prompt caching discounts for repeated context prefixes.
The Hidden Cost of Verbosity
System prompts and few-shot examples get sent with every request. A verbose system prompt of 2,000 tokens instead of an efficient 500-token version costs an extra $0.015 per request. At 10,000 requests per day, that is $150 daily, $4,500 monthly, $54,000 annually, all for saying the same thing less efficiently.
Practical Guidelines
Based on how context windows actually work, here are guidelines for building applications:
For Chat Applications
- Keep system prompts concise. Every token there is paid on every message.
- Implement sliding windows with summarization. Do not wait for truncation to happen automatically.
- Put critical instructions first and last. Avoid burying important guidance in the middle of long prompts.
- Track token usage. Monitor context consumption and warn users as they approach limits.
For Document Processing
- Chunk strategically. Overlap at boundaries, keep chunk sizes well under the context limit to leave room for prompts and output.
- Order by importance. If processing multiple documents, put the most critical ones first.
- Use RAG for retrieval. Do not rely on long-context models to find needles in haystacks. Vector search is more reliable for factual retrieval.
- Test the middle. Specifically test whether your application retrieves information placed in the middle of long contexts.
For Code Generation
- Be selective about context. Do not paste entire codebases. Include only relevant files.
- Watch tokenization. Code tokenizes expensively. A "small" code change might consume thousands of tokens.
- Use file references over inline code. Some systems can reference files by path rather than including full content.
The Evolution Continues
Context windows will keep growing. Models announced in 2024-2025 already handle millions of tokens. But the fundamental trade-offs remain:
- Longer context requires more computation
- Attention dilutes over distance
- Economic costs scale with usage
- Effective retrieval lags nominal capacity
The models that handle these trade-offs best will not necessarily have the largest context windows. They will have the most efficient use of whatever context they have.
Understanding these dynamics, knowing what tokens cost, where attention focuses, and how truncation works, is essential for building applications that work reliably within the invisible boundaries that shape every AI conversation.
References
- Vaswani, A., et al. "Attention Is All You Need." NeurIPS, 2017.
- Liu, N., et al. "Lost in the Middle: How Language Models Use Long Contexts." TACL, 2023.
- Beltagy, I., et al. "Longformer: The Long-Document Transformer." arXiv, 2020.
- Su, J., et al. "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv, 2021.
- Gu, A., & Dao, T. "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv, 2023.
- Liu, H., et al. "Ring Attention with Blockwise Transformers for Near-Infinite Context." arXiv, 2023.
- Dao, T., et al. "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS, 2022.
Related Articles
- The Elegant Hack Powering Modern AI - How LLMs transform text into tokens
- Breaking Text: A Brief History of Tokenizers - BPE, SentencePiece, and tiktoken
- Why Non-English Speakers Pay More for AI - The hidden cost of tokenization
- Words Learning the Company They Keep - The mechanism behind contextual embeddings
- The Hidden Geography of Language - Word embeddings and semantic space