← All Articles

What Breaks

LLM systems fail in ways that traditional software doesn't. The failures are probabilistic, context-dependent, and often invisible until a user complains.

In Brief

LLM system failures differ from traditional software failures in a fundamental way: they are probabilistic, often invisible until statistical patterns become obvious, and almost always live at component boundaries rather than inside components. The same prompt that produces a correct response 95 times out of 100 can fail confidently on the 96th call, a retrieval system starts returning slightly less relevant documents after a subtle data change, and a model update can shift how the same system prompt is interpreted. Six failure types recur across production systems: prompt drift, retrieval poisoning, tool cascade failures, context overflow, evaluation blind spots, and silent degradation, each with a distinct prevention strategy that ranges from regression testing and chunk-quality validation to idempotency keys, explicit token budgeting, multidimensional evaluation, and distribution-drift detection.

The common thread is that these failures occur at component boundaries, not inside components. Testing each piece in isolation is necessary but not sufficient, because the expensive failures (the ones that reach production and erode trust) live in the spaces between components. A post-mortem template that separates symptom from root cause, explains the detection gap (why did existing tests miss this?), and produces a regression test for the next person is what builds the discipline that catches boundary failures before they become customer-facing crises. The template itself does not fix anything; writing it forces the precision that eventually does.

Why LLM Failures Are Different

Traditional software fails deterministically. A null pointer dereference crashes the process every time. A network timeout returns a well-defined error code. You reproduce the bug, fix it, write a regression test, and move on. The failure mode is a known quantity.⁸⁹

LLM failures are stochastic.⁴¹⁰ The same input might produce a correct answer 95 times out of 100 and a subtly wrong answer the other five. The failure might not be a crash or an exception. It might be a response that is grammatically fluent, confidently stated, and factually incorrect. "Wrong" is often a judgment call rather than a binary.¹¹

This changes everything about how you debug. You cannot reproduce the failure reliably. You cannot write a deterministic test that catches it. You cannot bisect your way to a root cause with git blame. Instead, you need statistical methods, distribution-level monitoring, and evaluation frameworks that tolerate ambiguity.

The six failure types that follow are drawn from real production systems. Each follows the same structure: what the user saw, what actually happened, and what the team did about it.

. . .

Failure Type 1: Prompt Drift

What the user saw

A customer support bot worked well at launch. Three months later, users began complaining that it gave irrelevant answers, ignored product-specific instructions, and occasionally hallucinated features that did not exist. Support ticket volume increased 40% over six weeks.

What actually happened

The underlying model was updated from Claude 3 to Claude 3.5.¹² The new model interprets system prompts with different emphasis.¹² Phrases that previously anchored behavior ("You are a support agent for Acme Corp. Never discuss competitors.") were weighted differently by the updated model. The word "never" triggered more cautious behavior in the new version, causing the bot to refuse legitimate questions about product comparisons.¹³

Simultaneously, users had learned what the bot could do over three months of use. Their queries became more complex and more creative, pushing the system outside the original prompt's design envelope. The prompt was written for simple FAQ-style questions. Users were now asking multi-step troubleshooting flows.

The fix

The team introduced prompt regression testing: a suite of 200 representative queries with expected response characteristics, run automatically before any model update. They version-pinned the model in production and treated model upgrades as deployment events requiring the same rigor as code releases. They added weekly monitoring of response quality scores, computed by a separate evaluation model, to catch gradual drift before users noticed.⁷

. . .

Failure Type 2: Retrieval Poisoning

What the user saw

A RAG-based legal research tool started citing cases that did not exist.³¹⁴ The citations looked plausible: correct formatting, realistic party names, believable docket numbers. But when attorneys checked, the cases were fictional. Three briefs were filed with fabricated citations before the problem was caught.¹⁶

What actually happened

The document corpus was updated with a batch of scanned PDFs from a state court archive. The OCR quality was poor, and the chunking strategy split case citations across chunk boundaries. A citation like Smith v. Johnson, 482 F.3d 1127 (9th Cir. 2007) would be split into two chunks: one ending with "Smith v. Johnson, 482" and the next beginning with "F.3d 1127 (9th Cir. 2007)."

When the retrieval system returned these fragments, the model attempted to reconstruct complete citations from partial information. It combined elements from different cases, generating citations that were structurally valid but referentially empty. The model was not "hallucinating" in the usual sense. It was doing its best with corrupted input.¹⁵

The fix

The team implemented chunk quality validation: every chunk is scored for completeness before indexing. Legal citations are detected with regex patterns and kept intact within single chunks. A citation verification pipeline cross-references every case cited in a response against a legal database API. Documents that fail OCR quality thresholds are flagged for manual review rather than indexed automatically.

. . .

Failure Type 3: Tool Cascade Failures

What the user saw

An agentic travel assistant started making duplicate hotel reservations. Users would request one booking and find two charges on their credit card. In one case, a user was charged for four identical rooms at the same hotel for the same dates.

What actually happened

When the booking API returned a timeout after 30 seconds, the agent framework treated it as a failed tool call and asked the model to retry. But the timeout was on the response, not the request. The hotel's system had received and processed the booking; it simply took too long to send the confirmation back. Each "retry" created a new reservation.

The model cannot distinguish between "the action failed" and "I did not receive confirmation that the action succeeded." Both look identical from the model's perspective: a tool call that returned an error. Without additional context, retrying is the reasonable default. In this case, the reasonable default was wrong.¹⁷

The fix

The team added idempotency keys to all booking requests. Each tool call generates a unique key; if the same key is sent twice, the API returns the result of the first call instead of creating a duplicate. They also added a status-check tool: before retrying a booking, the agent first calls a "check booking status" endpoint to determine whether the previous attempt succeeded. Tool call results are cached so that timeout retries return the cached result rather than re-executing the action.

. . .

Failure Type 4: Context Window Overflow

What the user saw

A document analysis system worked perfectly on short documents (under 10 pages) but produced incoherent output on long ones. The responses for long documents would start coherently, then lose focus, contradict earlier statements, and sometimes ignore the analysis instructions entirely.

What actually happened

The system prompt consumed 2,000 tokens. The analysis instructions added another 1,500. Tool definitions took 800. For a short document of 3,000 tokens, the total was well within the 128k context window. For a 200-page regulatory filing tokenizing to 150,000 tokens, the total exceeded the context window by 30%.¹⁹

The API did not return an error. It silently truncated the input from the beginning, dropping the system prompt and the first portion of the analysis instructions. The model received the tail end of the document with no context about what it was supposed to do with it. It generated a completion, because that is what language models do, but the completion had no governing instructions.¹⁸

def allocate_context_budget(model_limit, system_prompt, tools, instructions):
    # Reserve space for output generation
    output_reserve = 4096

    # Calculate fixed overhead
    fixed_tokens = (
        count_tokens(system_prompt)
        + count_tokens(tools)
        + count_tokens(instructions)
    )

    # Remaining budget for document content
    doc_budget = model_limit - fixed_tokens - output_reserve

    if doc_budget < 1000:
        raise ValueError("Insufficient context budget for document content")

    return {
        "model_limit": model_limit,
        "fixed_overhead": fixed_tokens,
        "output_reserve": output_reserve,
        "document_budget": doc_budget,
    }

The fix

The team implemented explicit token budget management. Before every request, the system calculates available capacity for document content after accounting for fixed overhead (system prompt, tools, instructions) and a reserved output buffer. Documents exceeding the budget are processed in chunks with a map-reduce strategy: each chunk is analyzed independently, then the partial results are synthesized in a final pass. The system now fails loudly when the budget is exceeded rather than silently truncating.

. . .

What the user saw

An internal dashboard showed 95% accuracy on the benchmark suite. User satisfaction surveys returned 60% approval. Leadership asked why the system was failing when the metrics said it was succeeding.

What actually happened

The benchmark tested factual accuracy: given a question with a known answer, did the model get it right? It did, 95% of the time. But users did not care only about factual accuracy. They cared about tone ("too formal"), helpfulness ("it answered the wrong question"), appropriate caveats ("it stated uncertain things as facts"), and format ("I needed a table, not a paragraph").⁵²⁰

The evaluation suite measured one dimension of quality. Users experienced a dozen dimensions. The gap between these two realities was 35 percentage points of satisfaction. The benchmark was not wrong; it was incomplete. It measured what was easy to measure rather than what mattered.²¹

The fix

The team expanded their evaluation framework to include dimensions that mapped directly to user complaints: tone appropriateness, response format matching, confidence calibration, and task completion (did the user get what they needed, or did they have to rephrase?). They added a weekly sample of 50 responses reviewed by human evaluators against these dimensions. The benchmark still exists, but it is now one signal among several rather than the only signal that matters.²²

. . .

Failure Type 6: The Silent Degradation

What the user saw

Nothing, at first. No spike in error rates. No user complaints. Dashboards green across the board. But over eight weeks, the team noticed that users were asking more follow-up questions per session. Average session length increased 25%. Users were working harder to get the same results.⁶²³

What actually happened

An upstream data provider changed their API response format. Dates shifted from ISO 8601 to a locale-specific string. Currency values lost their currency code suffix. The ETL pipeline did not break; it ingested the new format without errors, because the fields were still strings of roughly the right length.

The embeddings computed over this subtly corrupted data were different from the original embeddings, but not dramatically so. Cosine similarity between old and new embeddings for the same concepts dropped from 0.92 to 0.78.²⁵ Retrieval still returned results. The results were just slightly less relevant, slightly less precise, slightly more likely to miss the best document and return the second-best one.²⁴²⁶

import numpy as np
from scipy import stats

def detect_embedding_drift(baseline_embeddings, current_embeddings, threshold=0.05):
    # Compare distributions using two-sample KS test
    baseline_norms = np.linalg.norm(baseline_embeddings, axis=1)
    current_norms = np.linalg.norm(current_embeddings, axis=1)

    # Test whether the distributions differ significantly
    ks_stat, p_value = stats.ks_2samp(baseline_norms, current_norms)

    # Also check mean cosine similarity shift
    mean_sim = np.mean([
        np.dot(b, c) / (np.linalg.norm(b) * np.linalg.norm(c))
        for b, c in zip(baseline_embeddings, current_embeddings)
    ])

    return {
        "drifted": p_value < threshold,
        "ks_statistic": round(ks_stat, 4),
        "p_value": round(p_value, 6),
        "mean_cosine_similarity": round(mean_sim, 4),
    }

The fix

The team implemented drift detection on embedding distributions. Every week, a batch job computes embeddings for a fixed set of reference documents and compares them against baseline embeddings computed at system launch. If the distribution shift exceeds a threshold (measured by the Kolmogorov-Smirnov statistic), an alert fires. They also added periodic re-evaluation against a gold standard set of queries with known best answers, measuring retrieval precision at rank 1, 3, and 5.

. . .

The Post-Mortem Template

Every failure deserves a written post-mortem. Not because the document itself fixes anything, but because the act of writing it forces precision. "The model hallucinated" is not a root cause. "The chunking strategy split legal citations across boundaries, causing the model to reconstruct fictional citations from fragments" is a root cause. The template below enforces that level of specificity.

# LLM System Post-Mortem Template

incident:
  title: "Brief description of the failure"
  severity: "P1 | P2 | P3"
  date_detected: "2025-01-15"
  date_resolved: "2025-01-17"

symptom:
  user_impact: "What the user experienced"
  detection_method: "How it was discovered (alert | user report | audit)"
  affected_scope: "Percentage of users or requests affected"

timeline:
  - time: "When the failure likely started"
  - time: "When it was detected"
  - time: "When the fix was deployed"

root_cause:
  category: "prompt_drift | retrieval | tool_failure | context | eval | data_drift"
  what_changed: "The specific change that caused the failure"
  technical_detail: "Precise explanation of the failure mechanism"

detection_gap:
  why_not_caught: "Why existing tests and monitoring missed it"
  missing_signal: "What metric or test would have caught it"

resolution:
  immediate_fix: "What was done to stop the bleeding"
  permanent_fix: "What was done to prevent recurrence"
  new_tests_added: "What regression tests were created"
  monitoring_added: "What new alerts or dashboards were created"

Two properties of this template matter. First, it separates symptom from root cause. "The bot gave wrong answers" is a symptom. The root cause is always deeper: a data change, a model update, a missing validation step. Second, it requires you to explain the detection gap. Why didn't your tests catch this? That question, answered honestly, produces the most valuable insight in the entire document.²⁷

. . .

The Common Thread

All six failures share a structural property: they occur at component boundaries.²⁸³⁰ The prompt was fine in isolation. The retrieval system returned results. The model generated fluent text. The tools executed without errors. But the interaction between these components produced the failure.²⁹

Prompt drift happens at the boundary between model versions and prompt design. Retrieval poisoning happens at the boundary between data ingestion and chunk construction. Tool cascade failures happen at the boundary between API behavior and the model's understanding of that behavior. Context overflow happens at the boundary between token accounting and API truncation behavior.

Testing components in isolation is necessary. It is not sufficient. The failures that reach production, the ones that cost money and erode trust, live in the spaces between components. Integration testing, end-to-end monitoring, and structured feedback from real users are the instruments that detect these boundary failures.

Traditional software engineering learned this lesson decades ago with integration testing and contract testing. LLM systems need the same discipline, adapted for probabilistic behavior and subjective quality. The post-mortem is one tool for building that discipline. It works because it forces you to look at the whole system, not just the part that seems broken.

. . .

References

Anthropic. "Model Card and Evaluations for Claude Models." Anthropic Research, 2024.
Chen, L., et al. "How is ChatGPT's Behavior Changing over Time?" arXiv, 2023.
Barnett, S., et al. "Seven Failure Points When Engineering a Retrieval Augmented Generation System." arXiv, 2024.
Renze, M. & Guven, E. "The Effect of Sampling Temperature on Problem Solving in Large Language Models." arXiv, 2024.
Shankar, S., et al. "Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences." arXiv, 2024.
Paleyes, A., Urma, R.-G., & Lawrence, N. "Challenges in Deploying Machine Learning: A Survey of Case Studies." ACM Computing Surveys, 2022.
Khattab, O., et al. "DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines." arXiv, 2023.

What Breaks

Why LLM Failures Are Different

Failure Type 1: Prompt Drift

What the user saw

What actually happened

The fix

Failure Type 2: Retrieval Poisoning

What the user saw

What actually happened

The fix

Failure Type 3: Tool Cascade Failures

What the user saw

What actually happened

The fix

Failure Type 4: Context Window Overflow

What the user saw

What actually happened

The fix

Failure Type 5: Evaluation Blind Spots

What the user saw

What actually happened

The fix

Failure Type 6: The Silent Degradation

What the user saw

What actually happened

The fix

The Post-Mortem Template

The Common Thread

References

Further Reading