The Invisible Boundaries of AI Conversation

Every LLM operates within a fixed-size window of attention. What happens at the edges of that window, how models forget, and why the most expensive tokens are often the ones you never meant to send.

When you paste a long document into ChatGPT and ask a question about the ending, the model sometimes gives an answer that ignores everything in the middle.

Cartoon illustration of a person with a magnifying glass searching through a giant scroll of text
Somewhere in there, the user asked a question.

This is not a bug.

It is a consequence of how attention mechanisms work under computational constraints. Understanding context windows means understanding the fundamental architecture of modern language models.

What Is a Context Window?

A context window is the maximum number of tokens a model can process in a single forward pass. Every piece of information the model considers must fit within this window:

Context Window: 8,192 tokens
[SYSTEM PROMPT] ~500 tokens
[USER MESSAGE 1] ~200 tokens
[ASSISTANT RESPONSE 1] ~400 tokens
[USER MESSAGE 2] ~150 tokens
[ASSISTANT RESPONSE 2] ~600 tokens
[USER MESSAGE 3] ~300 tokens
[ASSISTANT RESPONSE 3] ??? tokens
Used: 2,150 tokens
Remaining for generation: 6,042 tokens

This is a hard boundary, not a soft preference. Tokens outside the window do not exist to the model. They are not "deprioritized" or "dimly remembered." They are gone.

Tokens Are Not Words

The first misconception about context limits involves counting. A 128K context window does not mean 128,000 words. Tokenization compresses and expands text in non-intuitive ways.

English text averages roughly 0.75 tokens per word. But this ratio varies dramatically:

Text                                 Words   Tokens  Ratio
................................................................
"The cat sat on the mat."            6       6       1.00
"Antidisestablishmentarianism"       1       7       7.00
"def hello_world():"                 1       5       5.00
"你好"                                2       4       2.00
"https://example.com/api/v1/users"   1       11      11.00

Code is particularly expensive. A 500-line Python file might consume 3,000-4,000 tokens. URLs, technical jargon, and non-English text all tokenize inefficiently. When you paste code into a chat, you are spending context budget at roughly 5x the rate of prose.

Every major provider offers tokenization tools. For a deeper look at how these tokenizers evolved, see Breaking Text: A Brief History of Tokenizers.

↗ docs# OpenAI (tiktoken)
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode("Your text here")
print(f"Token count: {len(tokens)}")
# Llama (SentencePiece)
from transformers import LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b")
tokens = tokenizer.encode("Your text here")
print(f"Token count: {len(tokens)}")

For interactive exploration, the tiktoken documentation and various web-based token counters let you paste text and see exactly how it tokenizes.

The Quadratic Wall

Context windows have hard limits because of how attention works. Standard self-attention requires every token to attend to every other token. For a sequence of length N, this means N2 attention computations.

Sequence Length Attention Operations Memory (approx)
2,048 4,194,304 32 MB
8,192 67,108,864 512 MB
32,768 1,073,741,824 8 GB
131,072 17,179,869,184 128 GB
1,048,576 1,099,511,627,776 8 TB

This is not just computational cost. It is memory cost. Every attention weight must be stored. A million-token context window requires terabytes of memory just for the attention matrices, even before model parameters.

The quadratic scaling explains why context windows grew slowly for years, then jumped dramatically when new architectures emerged.

See for yourself:

Interactive demo showing context window size evolution across model generations
Interactive: Context window evolution demo

Each jump required either architectural innovation or massive hardware investment, often both.

How can providers like Google afford to offer such large context windows?

Looking at that table, a reasonable question emerges: if a 1M token context requires 8TB of memory with naive attention, how is Google offering 4M tokens in production? The quadratic math suggests 128TB of memory per query. No chip holds that.

The answer involves three things: distributed computation, algorithmic tricks, and brute force economics.

Ring Attention distributes the problem. Instead of one device computing N2 attention, the sequence spreads across hundreds of Tensor-Processing Units (TPUs) arranged in a ring topology. Each device handles a chunk and passes key-value pairs to its neighbors. What would require 128TB on one device becomes manageable portions across 256+ TPUs. The total computation is still O(N2), but no single chip needs to hold it all.

Flash Attention avoids materializing the full matrix. The critical insight: you do not need to store all N2 attention weights simultaneously. Flash Attention computes attention in blocks, processes each block, and discards it before moving to the next. Memory scales closer to O(N) rather than O(N2), at the cost of additional re-computation. You are trading compute cycles for memory savings.

Custom hardware makes the communication feasible. Ring attention only works if devices can pass KV pairs faster than the attention compute bottlenecks. Google's TPU pods have purpose-built interconnects optimized for exactly this communication pattern. Standard GPU clusters cannot do this efficiently at scale.

They pay the cost. A 4M token query is expensive. The latency is higher, the compute per query is massive, and API pricing reflects this reality. Google is not escaping O(N2); they are wealthy enough to afford it at scale, and they have built infrastructure specifically to absorb these costs.

The engineering is impressive. But the fundamental math has not changed. Quadratic scaling remains quadratic. The question is whether you can distribute, optimize, and pay your way through it.

The Attention U-Curve

Even within their context windows, models do not treat all positions equally. Research by Liu et al. (2023) documented a striking phenomenon: information placed in the middle of long contexts is retrieved less accurately than information at the beginning or end.

The study tested models with 20+ documents placed at various positions in the context. When the relevant document appeared first or last, accuracy exceeded 90%. When it appeared in the middle, accuracy dropped to 50-70%, depending on the model.

This has practical implications:

System prompts work well because they appear at the beginning of every context window. The model sees them with high attention.

Recent conversation matters more because it appears at the end. The model attends to it strongly.

Long documents struggle because their middle sections fall into the attention dead zone. A 50-page PDF may have its most important content at page 25, which the model effectively ignores.

Cartoon illustration of a person viewing a scroll with papers scattered around, representing truncated context
What falls outside the frame, stays outside the frame.

Several approaches help combat the middle-attention problem:

Chunking with overlap: Break long documents into chunks that overlap at boundaries. If the key information falls in the overlap region, it may appear near the start or end of at least one chunk.

Strategic ordering: Put the most important information first or last. If you have multiple documents, order them so critical content avoids the middle positions.

Repetition: State key facts multiple times throughout the context. This increases the chance that at least one instance falls in a high-attention region.

Explicit references: Ask the model to quote from specific sections before answering. This forces it to scan for particular content rather than relying on general attention patterns.

What Happens at the Boundary

When a conversation exceeds the context window, something must be removed. Different systems handle this differently.

Truncation Strategies

Oldest-first truncation: The most common approach. Drop the oldest messages to make room for new ones. The system prompt typically stays, but early conversation history disappears.

Before truncation (10K context, 12K used):
[System: 500] [Turn 1: 2000] [Turn 2: 2500] [Turn 3: 3000] [Turn 4: 4000]

After truncation:
[System: 500] [Turn 2: 2500] [Turn 3: 3000] [Turn 4: 4000]

Sliding window: Keep only the N most recent turns, regardless of token count. Simpler but wastes context if recent turns are short.

Selective truncation: Keep certain messages (system prompt, user preferences, key decisions) while dropping others. Requires metadata about message importance.

Summarization: Before dropping old messages, summarize them into a compressed form that captures key information in fewer tokens.

Original conversation: 8,000 tokens
â–¼
Summarize early turns
"User discussed project X,
requested feature Y, agreed to Z"
â–¼
Summary: 200 tokens + Recent: 4,000 tokens = 4,200 tokens total

The Summarization Trade-off

Summarization preserves more history but introduces risk. The summary is generated by the model itself. If the model misunderstands something early in the conversation, that misunderstanding gets baked into the summary and persists indefinitely.

There is also the question of what to include. Summaries necessarily lose detail. A fact that seemed unimportant when the summary was generated may become critical later.

Most production systems use a hybrid approach: summarize very old messages, keep recent messages verbatim, and preserve explicitly marked "important" messages regardless of age.

Try it yourself:

Interactive demo showing different truncation strategies for managing context overflow
Interactive: Context window truncation demo

Effective Context vs. Nominal Context

A model's advertised context window is its theoretical maximum. Effective context is how much of that window actually contributes to output quality.

Several factors reduce effective context:

Attention dilution: As context grows, attention spreads across more tokens. Each individual token gets less attention weight. Information that would be captured in a 4K context might be missed in a 128K context with the same content buried among more tokens.

Retrieval degradation: The "lost in the middle" phenomenon means that nominal context growth does not translate to proportional retrieval improvement.

Prompt engineering overhead: System prompts, few-shot examples, and output formatting instructions all consume context budget before user content even enters.

A practical rule: expect effective retrieval to scale logarithmically rather than linearly with context size. Doubling the context window does not double your ability to retrieve arbitrary facts from it.

Context Window Architectures

Different approaches to extending context windows trade off speed, accuracy, and implementation complexity.

Standard Self-Attention

The original Transformer architecture. Every token attends to every other token. O(N2) computation and memory.

Strengths: Conceptually simple, well-understood, no information loss.

Weaknesses: Does not scale beyond ~32K tokens without massive hardware.

Sparse Attention (Longformer, BigBird)

Replace full attention with patterns: local windows (attend to neighbors), global tokens (special tokens attend to everything), and random connections.

Strengths: Reduces complexity to O(N), enables longer contexts on standard hardware.

Weaknesses: Pattern design requires tuning, some information pathways are severed.

Rotary Position Embeddings (RoPE)

Encode position through rotation matrices rather than learned embeddings. Models can extrapolate to longer sequences than seen during training.

Strengths: Better length generalization, efficient computation.

Weaknesses: Still requires attention computation, extrapolation has limits.

Ring Attention

Distribute the attention computation across multiple devices. Each device handles a chunk of the sequence and passes key-value pairs around a ring topology.

Strengths: Scales to millions of tokens with enough devices.

Weaknesses: Requires specialized infrastructure, latency increases with ring size.

State Space Models (Mamba, etc.)

Replace attention entirely with selective state space layers that process sequences in O(N) time without the quadratic attention matrix.

Strengths: Linear scaling, no context window in the traditional sense.

Weaknesses: Newer architecture, less battle-tested, quality trade-offs still being understood.

Token Economics

Context windows are not just technical constraints. They are economic ones. API providers charge per token, both input and output.

Cost structure (example pricing):

Input tokens:  $0.01 per 1K tokens
Output tokens: $0.03 per 1K tokens

A conversation that uses full 128K context:
   128K input tokens  = $1.28
   + 2K output tokens = $0.06
   = $1.34 per exchange

A conversation with efficient 8K context:
   8K input tokens    = $0.08
   + 2K output tokens = $0.06
   = $0.14 per exchange

The 128K conversation costs nearly 10x more for the same output. And this cost compounds: every subsequent message in that conversation pays for the full context again.

This creates economic incentive to minimize context usage:

The Hidden Cost of Verbosity

System prompts and few-shot examples get sent with every request. A verbose system prompt of 2,000 tokens instead of an efficient 500-token version costs an extra $0.015 per request. At 10,000 requests per day, that is $150 daily, $4,500 monthly, $54,000 annually, all for saying the same thing less efficiently.

Practical Guidelines

Based on how context windows actually work, here are guidelines for building applications:

For Chat Applications

  1. Keep system prompts concise. Every token there is paid on every message.
  2. Implement sliding windows with summarization. Do not wait for truncation to happen automatically.
  3. Put critical instructions first and last. Avoid burying important guidance in the middle of long prompts.
  4. Track token usage. Monitor context consumption and warn users as they approach limits.

For Document Processing

  1. Chunk strategically. Overlap at boundaries, keep chunk sizes well under the context limit to leave room for prompts and output.
  2. Order by importance. If processing multiple documents, put the most critical ones first.
  3. Use RAG for retrieval. Do not rely on long-context models to find needles in haystacks. Vector search is more reliable for factual retrieval.
  4. Test the middle. Specifically test whether your application retrieves information placed in the middle of long contexts.

For Code Generation

  1. Be selective about context. Do not paste entire codebases. Include only relevant files.
  2. Watch tokenization. Code tokenizes expensively. A "small" code change might consume thousands of tokens.
  3. Use file references over inline code. Some systems can reference files by path rather than including full content.

The Evolution Continues

Context windows will keep growing. Models announced in 2024-2025 already handle millions of tokens. But the fundamental trade-offs remain:

The models that handle these trade-offs best will not necessarily have the largest context windows. They will have the most efficient use of whatever context they have.

Understanding these dynamics, knowing what tokens cost, where attention focuses, and how truncation works, is essential for building applications that work reliably within the invisible boundaries that shape every AI conversation.


References

  1. Vaswani, A., et al. "Attention Is All You Need." NeurIPS, 2017.
  2. Liu, N., et al. "Lost in the Middle: How Language Models Use Long Contexts." TACL, 2023.
  3. Beltagy, I., et al. "Longformer: The Long-Document Transformer." arXiv, 2020.
  4. Su, J., et al. "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv, 2021.
  5. Gu, A., & Dao, T. "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv, 2023.
  6. Liu, H., et al. "Ring Attention with Blockwise Transformers for Near-Infinite Context." arXiv, 2023.
  7. Dao, T., et al. "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS, 2022.

Related Articles

Context Window Token Limit Attention Mechanism LLM Architecture Prompt Engineering