Sources

Grounding, citations, and further reading for Tool Use Postmortems.

All of this is optional. The article's case studies are synthetic composites, but the failure mechanisms each one illustrates are well documented in the distributed-systems and ML-deployment literature. The sources below are the lineage that backs each lesson.

Nothing on this page is required reading. The article is self-contained. This page exists so the engineering provenance of each case study is properly cited and so anyone who wants to read the original treatment of idempotency, the confused-deputy problem, ReAct, or schema validation knows exactly where to look.

About the Sources

Provider tool-use documentation (Anthropic, OpenAI)

First-party API references for tool use across providers.

The runtime contracts the article calls out (correlate parallel results by tool_use_id, return tool results keyed by call ID, validate parameters at the boundary) are documented behaviors of the major function-calling APIs. The case studies dramatize what goes wrong when the runtime ignores the contract; the docs are the canonical statement of the contract itself.

Distributed systems literature (Helland, Hardy)

ACM Queue and ACM SIGOPS canonical papers on idempotency and capability authorization.

Two papers carry most of the conceptual weight of the article. Helland's Idempotence Is Not a Medical Condition is the standard reference for why every side-effectful network call needs an idempotency key and how a service should deduplicate on receipt. Hardy's The Confused Deputy is the foundational treatment of capability-based authorization, predating the LLM era by 36 years but describing the exact failure mode in Case 4. The article's claim that boundary problems have decades of literature behind them is grounded in these two papers and their successors.

LLM agent and tool-use research (ReAct, Toolformer, Gorilla)

Foundational papers on tool-augmented language models.

ReAct introduced the reason-act loop that every modern agent runtime implements; the Runaway Loop case (Case 6) is a failure of the termination half of that pattern. Toolformer demonstrated that models can learn to use tools through self-supervision and gave the first systematic treatment of when tool use helps and when it does not. Gorilla showed that models routinely hallucinate plausible-but-non-existent API calls when the tool catalogue is implicit, which is exactly what Case 3 dramatizes.

ML deployment and failure surveys (Barnett, Paleyes)

Empirical studies of where production ML systems break.

Barnett et al.'s Seven Failure Points is the closest parallel to this article's structure but for RAG systems; the framing of failures as boundary problems is shared. Paleyes, Urma, and Lawrence cataloged the broader set of ML deployment challenges across 30+ case studies, with strong emphasis on the validation, schema, and silent-coercion failures the article's Case 5 and Case 7 illustrate.

Why Tool Use Fails Differently

10The companion article: broader LLM failure catalog

The companion article What Breaks covers prompt drift, retrieval poisoning, context overflow, evaluation blind spots, and silent data drift. This article narrows the scope to tool use specifically, where the failure surface is constrained to the seam between model, runtime, and underlying service. Reading both gives a fuller picture of how LLM-driven systems fail in production.

Trim, C. (2025). "What Breaks." craigtrim.com. ↩ Back to article

3Failure points framing for RAG, paralleled here for tool use

Barnett et al. catalog seven failure points in production RAG systems and argue that most failures occur at boundaries between components rather than inside any single component. The same framing applies to tool use: model, runtime, and underlying service each work fine in isolation, and the interesting failures are at the seams. The article's "did they agree about what just happened" question is a direct echo of Barnett's analysis.

Barnett, S., et al. (2024). "Seven Failure Points When Engineering a RAG System." arXiv. ↩ Back to article

8Production ML failures as boundary and validation problems

Paleyes, Urma, and Lawrence surveyed 30+ case studies of production ML deployments and found that the dominant failure modes were schema mismatches, silent type coercion, missing validation gates, and unclear ownership at component boundaries. The article's "boring failures are everywhere" claim is supported directly by this survey: fewer than 10% of the cases involve novel ML behavior, and the great majority are problems traditional engineering already knows how to attack.

Paleyes, A., Urma, R.-G., & Lawrence, N. (2022). "Challenges in Deploying Machine Learning: A Survey of Case Studies." ACM Computing Surveys. ↩ Back to article

Case 1: The Idempotency Cascade

6Idempotency as the canonical answer to retry-on-ambiguous-failure

Helland's paper is the standard reference for why every side-effectful network call should carry an idempotency key. He argues that the receiver, not the caller, must decide whether a request is a duplicate, because the caller cannot reliably distinguish "the action did not happen" from "the action happened and the confirmation got lost." The forty-seven duplicate orders in Case 1 are the exact failure mode Helland warned about, with the LLM agent playing the role of the unreliable caller.

The article's resolution (UUID per original call, threaded through every retry, deduped on receipt) is a textbook application of Helland's pattern. Lowering retry counts to 3 and adding an explicit status-check tool is the secondary recommendation: when the caller cannot tell whether the action succeeded, expose a way for the caller to ask.

Helland, P. (2012). "Idempotence Is Not a Medical Condition." ACM Queue. ↩ Back to article

Case 2: The Out-of-Order Parallel

1Provider-documented tool_use_id correlation

The Anthropic tool-use documentation is explicit that parallel tool calls each carry a unique tool_use_id and that tool results must be returned with the matching ID. The same correlation pattern exists in OpenAI's tool_call_id and Bedrock's toolUseId. The runtime in Case 2 ignored this contract and appended results in arrival order, which is what produced the swapped names. The fix in the case study (thread the ID through the entire lifecycle, add a property-based test that randomizes return order) is straight from the docs.

Anthropic. (2024). "Tool Use Overview." Anthropic Documentation. ↩ Back to article

Case 3: The Hallucinated Tool

5Gorilla's documentation of API hallucination

Patil et al. demonstrated that LLMs reliably hallucinate plausible-but-non-existent API calls when prompted to use tools without an explicit catalogue. They quantified the rate (high enough to dominate the failure distribution on out-of-distribution tasks) and showed that grounding the model in retrieved API documentation reduces it. Case 3 is exactly this failure mode: the system prompt referenced "lore tools" by category, the model pattern-matched on adjacent fictional API designs from pretraining, and the runtime returned the correct error but the error string was not emphatic enough to break the loop.

Patil, S. G., et al. (2023). "Gorilla: Large Language Model Connected with Massive APIs." arXiv. ↩ Back to article

9Toolformer's framing of when models should and should not call tools

Toolformer studies when models learn to call tools at the right time and when they do not. One of its findings is that models trained on supervised tool-use traces still over-call when the prompt rewards thoroughness or under-specifies the catalogue. The Case 3 system prompt had both problems. The fix in the case study (list every available tool with an explicit "no other tools exist" sentence) is consistent with Toolformer's experimental design, where the catalogue is an explicit input rather than something the model is expected to infer.

Schick, T., et al. (2023). "Toolformer: Language Models Can Teach Themselves to Use Tools." arXiv. ↩ Back to article

Case 4: The Confused Deputy

7Hardy's foundational treatment of the confused-deputy problem

Hardy's 1988 paper coined the term and gave the canonical example: a compiler running with privileges on behalf of a user, asked to write its output to a file the user does not have permission to write. The compiler's privileges, not the user's, decide whether the write succeeds. Case 4 is the modern restatement: an agent with broad service-account credentials, asked to summarize a document, accidentally returns personnel data the calling user has no right to see. Hardy's prescription (delegate authority through capability tokens, authorize at the underlying service against the principal's identity) is exactly the resolution the case study describes.

Reading the original paper is worth the time even today. The pattern recurs in every system that delegates execution to a privileged intermediary, and LLM agents are the latest instance.

Hardy, N. (1988). "The Confused Deputy." ACM SIGOPS Operating Systems Review. ↩ Back to article

Case 5: The Schema Drift

2Schema-as-contract in provider function-calling docs

The OpenAI function-calling guide treats the tool's parameter schema and response shape as the contract between the model and the runtime. The guide explicitly recommends versioning the schema and updating prompt examples when the shape changes. Case 5 is the failure mode that arises when the backend changes the response shape without updating the prompt's worked example: the model continues to interpret the response through the old shape, and silent semantic errors propagate to users.

The case study's resolution (treat the prompt and tool definitions as deployment artifacts, regress against known inputs on every prompt or backend change) is the operational implementation of the docs' "schema is contract" framing.

OpenAI. (2024). "Function Calling Guide." OpenAI Platform Documentation. ↩ Back to article

Case 6: The Runaway Loop

4ReAct: the reason-act loop and the importance of termination

Yao et al.'s ReAct paper introduced the alternating reason-act-observe loop that every modern agent runtime implements. The paper is also explicit that termination is the loop's hardest design problem: without an explicit stop condition, agents tend to extend the trajectory because each new observation suggests a plausible next action. Case 6 is exactly this failure: the prompt rewarded thoroughness without bounding it, and the model interpreted "fully" as license to keep researching.

The case study's resolution (write the stop condition into the prompt and cap it in the runtime) is the standard operational pattern for productionizing ReAct-style agents. The paper itself describes both halves in section 3, where the trajectory length and the reward shaping for early termination are discussed at length.

Yao, S., et al. (2022). "ReAct: Synergizing Reasoning and Acting in Language Models." arXiv. ↩ Back to article

Case 7: The Out-of-Bound Argument

Case 7's grounding sits primarily with the Paleyes survey above (silent type coercion as a recurring ML deployment failure mode). The case study's resolution (turn schema validation back on as a deployment gate, refuse silent coercion in tool implementations, add sanity-check rules at the runtime boundary) is the standard operational answer documented across the survey's case studies.

See reference 8 above. ↩ Back to article

Worked Example: PocketOS, April 2026

11The Register: incident reporting on the nine-second deletion

The Register's coverage is the most factually detailed account of the incident: the date (April 25, 2026), the agent (Cursor + Claude Opus 4.6), the company (PocketOS, a SaaS for car-rental businesses), the duration (roughly nine seconds), and the recovery position (most recent snapshot was three months old, customer reservation data being rebuilt from Stripe payment records, calendar integrations, and email confirmations). Used in the worked example as the source of record for the ground-truth narrative.

Claburn, T. (2026). "Cursor-Opus agent snuffs out startup's production database." The Register. ↩ Back to article

12NeuralTrust: security post-mortem and Railway architecture analysis

NeuralTrust's analysis is the strongest source for the architectural details: how the agent located the API token in an unrelated file, how the token's privileges allowed direct volume deletion, and how Railway's storage model placed volume-level backups inside the same volume as the primary data, which collapsed the blast radius into a single destructive call. The post-mortem maps the failure modes onto a security framework that overlaps directly with Case 4 (capability misuse) and the architectural blast-radius observation in the worked example.

NeuralTrust. (2026). "A Security Post-Mortem of the 9-Second AI Database Deletion." NeuralTrust Blog. ↩ Back to article

13Fast Company: the agent's self-incrimination

Fast Company's piece is the source for the agent's confession: the explicit acknowledgment that it violated every safety rule in its system prompt, including an instruction never to execute destructive or irreversible commands without user approval. The agent admitted guessing that a staging-scoped deletion would not affect production, without verifying the volume's cross-environment reach or reading Railway's documentation. This is the artifact that makes the worked example unambiguous about the framing in Case 4: the system prompt was documentation, and no underlying service treated it as enforcement.

Crane, J. (2026). "'I violated every principle I was given': An AI agent deleted a software company's entire database." Fast Company. ↩ Back to article

14The confession is a generation event, not introspection

The agent's apology reads as remorse, but mechanically it is the most probable continuation of a prompt that asks an LLM to explain a destructive action. The model is not retrieving the truth about its own decision process and presenting it; it is sampling tokens that fit the apology genre conditioned on the surrounding context. The act of deletion and the act of acknowledging the violated principles are two separate generation events. The only thing connecting them is attention over a shared context window, which is far weaker than a chat interface implies when it interleaves prompt and response on the screen.

The companion article Acknowledgment Is Not Adherence develops this distinction directly: the model reads a rule, restates it in its own words, agrees to follow it, and violates it in the next turn, because the acknowledgment and the action are independent draws from the model's distribution rather than a single chain of inference. For the underlying token-level mechanics that make this true (embedding lookup, attention over the context window, the language-model head, sampling from the resulting distribution), see From Prompt to Token: How LLM Inference Actually Works.

Reading the agent's apology as remorse is the same category error as reading a thermometer's display as the thermometer's opinion about the temperature. The model produces a string. Whether that string corresponds to anything inside the model that resembles understanding is an empirical question that the string itself cannot answer.

Trim, C. "Acknowledgment Is Not Adherence." craigtrim.com. Mechanical grounding: "From Prompt to Token: How LLM Inference Actually Works." ↩ Back to article