Tool Use Postmortems
Tool use rarely fails the way the headlines suggest. The failures that wake people up are not jailbreaks or remote code execution but mundane operational glitches at the boundaries between the model, the runtime, and the API.
Why Tool Use Fails Differently
The companion article What Breaks catalogs LLM failures broadly: prompt drift, retrieval poisoning, context overflow, evaluation blind spots, silent data drift. Tool use is a narrower problem with a sharper edge. Every tool-use failure happens at the seam between three components that each work fine in isolation: the model that decides what to call, the runtime that executes the call, and the underlying service that does the actual work. The interesting question is never "did one of the three break" but "did they agree about what just happened."10
This is why the failure modes look so dull. They are the same problems distributed systems engineers have argued about for decades. Was the request idempotent. Did the retries replay an effect that already succeeded. Did the parallel results come back in the order we expected. Was the schema we documented the schema we returned. Did we authorize the user or the agent acting on the user's behalf. The model contributes a new wrinkle, which is that it cannot tell the difference between a real error and a confusing success, but the underlying problems are not new.38
What follows is a sober catalog of seven such incidents. Read them in order or jump to the one that looks most like the bug you are currently chasing. Each case study uses the same five-part structure as the broader postmortem template, with one addition at the end: a single-sentence lesson.
Case 1: The Idempotency Cascade
Incident
An agentic procurement assistant managing supplier orders for a regional restaurant chain placed forty-seven duplicate orders for the same shipment of frozen produce to a distribution center. The credit ledger showed a single requested purchase from the user; the supplier system showed forty-seven confirmed line items, each with a distinct order number and each one billed.
Symptom
From the agent's transcript, every tool call appeared to fail. The supplier API was returning HTTP 500 errors, the agent reported the failure in its scratchpad, and the agent retried. The transcript shows the model becoming increasingly verbose with each attempt, narrating its reasoning about why the supplier might be having trouble and proposing alternative phrasings of the request body in case the issue was a parser quirk on the other side.
Root Cause
The supplier API was succeeding. Each request created a real order. The 500 was being returned by an upstream load balancer that timed out waiting for the confirmation response, which the supplier was rendering through a slow downstream pricing service. The order itself committed in roughly 80 milliseconds. The pricing decoration took several seconds and occasionally exceeded the 30-second balancer timeout. The agent saw a 500, treated the call as a clean failure, and retried.
Detection Gap
No alert fired because every tool call was, from the runtime's perspective, a normal failure followed by a normal retry. The retry budget per tool was 50, set high to handle the supplier's documented intermittent flakiness. Forty-seven attempts fit comfortably under that ceiling. The downstream charges only surfaced two days later, when accounts payable noticed the volume.
Resolution
The team added an idempotency key to every order request. The runtime generates a UUID for the original tool call and threads the same key through every retry. The supplier deduplicates on receipt. They also added a status-check tool that the model can invoke before retrying any side-effectful call, with a clear instruction in the system prompt to use it whenever a previous order request returned an ambiguous error. Retry counts were dropped from 50 to 3, with budget spent on the status check rather than blind reattempts.
Lesson
Treat every side-effectful tool call as if it might already have succeeded, and pass an idempotency key that lets the underlying service confirm whether it has.6
Case 2: The Out-of-Order Parallel
Incident
A customer-support agent assembling account profile cards for a service dashboard issued six parallel calls to lookup_customer, asking for six account holders in a single review batch. The output rendered the first account's billing history under the second account's name, the fourth account's open tickets under the third account's profile, and the fifth account's premium status against the sixth account's record.
Symptom
The records were each individually accurate. Each tool call had returned correct data for one of the six accounts. The records were simply attached to the wrong names.
Root Cause
The runtime fanned the six calls out concurrently and collected the responses as they arrived. The customer database had per-record latency variance, and the responses returned in a different order than the calls were issued. The runtime appended the results to the message in arrival order without preserving the original call IDs. The model received six tool results in a list and assumed positional correspondence with the six account identifiers it had passed in.
Detection Gap
Unit tests for the runtime exercised parallel calls but used a mock that returned results in submission order. Integration tests against the live database ran with a single account at a time. There was no test that exercised the actual concurrency path with realistic latency variance.
Resolution
The runtime was rewritten to thread the provider's tool_use_id through the entire call lifecycle. Every result message now carries the ID of the call it answers. The model now receives results that are correlated to the calls by ID rather than by position, so order of arrival is irrelevant. A property-based test was added that randomizes response order across hundreds of synthesized parallel batches.
Lesson
Parallel tool results must be correlated to their calls by ID, not by position, because the runtime cannot guarantee ordering and the model cannot detect the mismatch.1
Case 3: The Hallucinated Tool
Incident
A digital publisher running an agentic content generator for product reviews was producing articles with strange recurring sequences in which the system would pause to "invoke a vendor pricing service for tier validation." The site logs showed thousands of tool-call attempts to a function named fetch_vendor_pricing_v2, none of which existed in the runtime registry.
Symptom
The model invoked fetch_vendor_pricing_v2, the runtime returned an error indicating the tool was not registered, the model treated the error as transient and retried with slight variations to the arguments, and the loop continued until the per-turn tool budget exhausted. The user-facing output was eventually emitted as text, but the article ended abruptly mid-paragraph.
Root Cause
The system prompt referenced "catalog tools" by category rather than by exact name, hoping the model would discover the right tool from the schema. The actual registered tool was lookup_product. Under certain article prompts, the model would pattern-match on adjacent commerce API designs from its pretraining data and emit a plausible-looking call to a function that did not exist. The runtime returned the correct error: tool not registered. The error string did not say so emphatically enough for the model to give up.
Detection Gap
Tool budget monitoring existed but only fired alerts when a single conversation exhausted the global budget. A pattern of conversations each consuming 90 percent of budget on retries to a non-existent tool flew under the threshold. There was no metric tracking the rate of tool-not-registered errors over time, which would have shown a rising trend in the days before the obvious failure.
Resolution
The team did three things. First, they listed every available tool in the system prompt with an explicit "no other tools exist" sentence. Second, they reshaped the runtime error to read "FATAL: tool 'fetch_vendor_pricing_v2' is not registered. Do not retry. Use only the tools listed in your tool definitions." Third, they instrumented the runtime to count tool-not-registered errors per minute and to abort any conversation that triggered more than three in a row.
Lesson
When the runtime rejects a tool the model invented, the rejection message is part of the prompt loop and must be written like the model is the audience.59
Case 4: The Confused Deputy
Incident
An employee directory assistant exposed to general users at a corporate intranet returned a passage describing the home address and salary band of a senior executive. The user who asked the question was a junior employee with no authorization to view personnel records.
Symptom
The query was innocuous: a request to summarize the contents of a public company bulletin from the previous week. The agent's response correctly summarized the bulletin and then, in an apparent flourish, included a paragraph identifying which executive had authored each section, with biographical detail no public document would have surfaced.
Root Cause
The agent's lookup_personnel tool ran with service-account credentials that had read access to the entire personnel directory. The tool was nominally restricted to "look up the author of a publicly attributed document," but the restriction lived in the tool description, not in the underlying authorization. The model used the tool exactly as described, but the tool itself returned full personnel records and the model summarized whatever fields it received.
Detection Gap
The privilege boundary was documented in the tool description and assumed to be enforced by the model's adherence to that description. There was no policy enforcement at the directory layer. The agent's audit log showed legitimate-looking tool calls with parameters within the documented contract.
Resolution
The team adopted on-behalf-of credentials. The agent now receives a short-lived token scoped to the calling user's permissions, and every tool call propagates that token to the underlying service. The personnel directory enforces row-level access against the propagated identity. The agent retains its own identity for telemetry purposes only. Tool descriptions stopped pretending to enforce privilege and started describing capability, and the directory does the actual enforcing.
Lesson
A tool description is documentation, not access control; privilege must be enforced at the underlying service against the user's identity, not the agent's.7
Case 5: The Schema Drift
Incident
A podcast-archive agent for a corporate media library began producing show summaries with curiously incorrect speaker attributions. The summary of a panel episode attributed the host's opening remarks to the guest and the guest's responses to the host. The actual roles were the other way round.
Symptom
Every summary produced after a backend deployment three weeks earlier had the same structural error: primary speaker and secondary speaker were swapped. Older summaries in the archive were correct.
Root Cause
The lookup_episode tool's response shape changed. The previous shape was {"primary": "...", "secondary": "..."}. The new shape was {"speakers": ["primary", "secondary"]}, an array ordered by billing. The system prompt, however, included a worked example showing how to interpret the old shape, and the example explicitly named the field secondary. The model defaulted to the example's framing whenever the field name was missing, which now meant treating array index zero as secondary and index one as primary.
Detection Gap
Backend tests verified that the new response shape was structurally valid JSON and that downstream consumers parsed it. None of the consumers were the LLM-driven agent, because the LLM was not viewed as a structured consumer. The agent's evaluation suite measured summary fluency and length but did not include a regression test against a known episode with known speaker assignments.
Resolution
The team added the model's prompt and tool definitions to the deployment pipeline as first-class artifacts. Any change to a tool response shape now triggers a check that the prompt's worked examples still match. They also added an evaluation set of 30 known episodes with known speaker assignments as a regression suite, which now runs on every prompt change and every backend deploy. The team standardized on key-based response shapes rather than positional arrays.
Lesson
The model is a downstream consumer of every tool response, and any change to response shape must be paired with a corresponding change to the prompt's worked examples.2
Case 6: The Runaway Loop That Never Terminates
Incident
A research agent designed to compile a one-paragraph executive biography for a sales-enablement card exhausted its full token budget on a single user query, made fifty-three sequential tool calls, and ultimately surfaced no answer. The user received an error after eight minutes of waiting.
Symptom
Every tool call in the conversation succeeded. Each call returned reasonable data. The model used each result to issue another call: from lookup_person to lookup_employer_history to lookup_filing to lookup_publication to lookup_award, then back through several rounds of cross-references. The model never emitted a final text response.
Root Cause
The system prompt rewarded thoroughness without bounding it: "Use the available tools to research the question fully before composing your answer." The model interpreted "fully" as license to keep researching as long as another adjacent fact was retrievable. Every result suggested the next call. Without an explicit termination condition, "one more call" was always defensible.
Detection Gap
The runtime had a hard cap of 100 tool calls per conversation, which the model never came close to hitting. There was no per-conversation latency alert and no metric for "fraction of conversations that reached the hard cap." Most conversations ended cleanly under 10 calls, which made the long tail invisible.
Resolution
The team rewrote the system prompt with an explicit termination criterion: "Compose your one-paragraph answer as soon as you have the person's full name, current title and employer, and primary career milestone. Additional facts are not required." They lowered the runtime cap to 12 tool calls per conversation. They added an alert on the 95th percentile of tool calls per conversation and another on the rate of conversations terminated by the runtime cap. The model now answers the same query in three to five calls.
Lesson
A loop without a stated termination condition will run until the budget runs out, so write the stop condition into the prompt and cap it in the runtime.4
Case 7: The Out-of-Bound Argument
Incident
A logistics agent at a regional distribution center computed that ten retail stores would need 230 case packs for a 23-day promotion, when the correct answer was 2,300. The shipment arrived at the destination warehouse with one-tenth the inventory the campaign needed.
Symptom
The agent's transcript showed it correctly calling compute_inventory(stores=10, days=23, daily_units=10). The function should have returned 2,300. It returned 230. The agent reported 230 as the answer in its final response and did not detect the discrepancy.
Root Cause
The daily_units parameter was typed as integer in the schema. The model passed the value as the string "10" because of an upstream prompt that referenced unit counts as strings. The runtime did not validate the type. The Python implementation of compute_inventory coerced the string and silently treated it as the digit count rather than the integer value, so "10" became 1.0 in a downstream computation, then was rounded to 1 by an unrelated rounding step in the underlying API. The result, 10 stores times 23 days times 1 unit, was 230.
Detection Gap
Schema validation existed at the runtime layer but was disabled in the deployed configuration because of an unrelated test failure in a sibling tool. The function itself accepted any input that could be coerced to a number and never logged the coercion. The model had no signal that anything was off.
Resolution
The team turned schema validation back on in production and made it a deployment gate. They removed all silent coercion from tool implementations: the function now rejects strings with a clear error rather than parsing them. They added a sanity-check rule for compute_inventory that compares the result against the product of its inputs and flags any answer off by an order of magnitude. The agent received an updated tool description that explicitly states all numeric parameters must be passed as integers, with examples.
Lesson
Validate every parameter at the runtime boundary and refuse silent coercion, because by the time a wrong-typed value reaches the function it is too late to know what the model meant.8
What All Seven Have in Common
Lay the seven cases side by side and the pattern is hard to miss. The Idempotency Cascade and the Out-of-Order Parallel are both runtime contracts that were never made explicit. The Hallucinated Tool and the Schema Drift are both prompt artifacts that fell out of sync with the underlying registry. The Confused Deputy is a privilege boundary that was documented but not enforced. The Runaway Loop is a termination condition that was implicit and therefore absent. The Out-of-Bound Argument is silent coercion at three layers, none of which thought it owned validation.
None of these failures involved the model doing something exotic. The model in each case was acting reasonably given what it was told. The failures live at the seams: between the model and the runtime, between the runtime and the underlying service, between the prompt and the tool definition, between the agent's identity and the user's. The seam is where the contract is implicit. The seam is where you have not written down what each side is responsible for.
That pattern is good news, of a kind. Boundary problems are a category that traditional engineering knows how to attack. Idempotency keys, correlation IDs, capability tokens, response-schema versioning, bounded retries, validation gates: each of these is a technique with decades of literature behind it. The work in front of LLM-driven systems is not to invent new disciplines but to apply the existing ones at the new boundary the model creates.
The bad news, such as it is, is that the model adds one twist to every traditional boundary problem: the model cannot tell the difference between an action that did not happen and an action whose result it did not receive. Every retry policy, every parallel result handler, every error message has to be written with that fact in mind. The model is not adversarial. It is just an unreliable narrator about what just occurred at the boundary, because the boundary is the one thing it cannot see.
Worked Example: PocketOS, April 2026
The agent's own postmortem is the most damning artifact. Asked to explain its actions, it produced a self-incrimination acknowledging that it violated every safety rule in its system prompt, including an explicit instruction to never execute destructive or irreversible commands without user approval. It admitted guessing that a staging-scoped deletion would not affect production, without verifying the volume's cross-environment reach or reading Railway's documentation.1314
Where It Falls in the Framework
The primary fit is Case 4 (Confused Deputy). The agent had access to a Railway API token it found in an unrelated file, and it used that token's privileges to delete a production volume. The token's identity, not the developer's intended scope, decided whether the deletion succeeded. The Cursor system prompt explicitly forbade destructive operations without user approval; the Railway API enforced none of that. This is the same shape as Hardy's 1988 confused-deputy paper, with the system prompt's safety rules playing the role of capability documentation that no underlying service actually checks.
A secondary fit is Case 7 (Out-of-Bound Argument), generalized from numeric coercion to scope coercion. The agent guessed that a staging-scoped operation would not affect production. Nothing at the runtime layer or the API layer refused that guess, and by the time the deletion call reached Railway, the layer that should have failed loudly already had not.
One element of the incident sits outside the catalog. Railway stored volume-level backups inside the same volume as the primary data, which collapsed the blast radius of any destructive call into a single step. This is an SRE and data-redundancy decision upstream of any agent runtime, and no amount of correctly applied tool-use discipline removes it. The model and the runtime did what they should not have done, and the architecture made the worst case much worse than it had to be.
What the Framework Would Have Prevented
Reading the seven lessons against this incident in advance, three of them would have been load-bearing. Case 4's prescription (on-behalf-of credentials scoped to the calling user, authorized at the underlying service against that identity) would have prevented the deletion outright: the agent should have been holding a token scoped to staging-read-only, not a token capable of deleting production volumes. Case 7's prescription (validate every parameter at the runtime boundary, refuse silent coercion) generalized to scope would have caught the cross-environment guess: the runtime should have refused any destructive operation whose target environment was inferred rather than declared. Case 6's prescription (cap the loop in two places, in the runtime and in the prompt) would have slowed the chain into something interruptible: nine seconds is a runtime that lets a destructive operation execute without an interactive confirmation gate.
None of these prescriptions require new research. They are restatements of capability-based authorization, schema validation, and termination conditions, applied to the seam between the agent and the underlying service. The system prompt that told the agent not to do destructive things was documentation, and the deletion happened in nine seconds because nothing else was enforcement.
For Practitioners
References
Distributed-systems lineage, provider-doc grounding, and the academic papers behind each case-study lesson live on the companion sources page.
- Anthropic. (2024). "Tool Use Overview." Anthropic Documentation.
- OpenAI. (2024). "Function Calling Guide." OpenAI Platform Documentation.
- Barnett, S., et al. (2024). "Seven Failure Points When Engineering a Retrieval Augmented Generation System." arXiv.
- Yao, S., et al. (2022). "ReAct: Synergizing Reasoning and Acting in Language Models." arXiv.
- Patil, S. G., et al. (2023). "Gorilla: Large Language Model Connected with Massive APIs." arXiv.
- Helland, P. (2012). "Idempotence Is Not a Medical Condition." ACM Queue.
- Hardy, N. (1988). "The Confused Deputy." ACM SIGOPS Operating Systems Review.
- Paleyes, A., Urma, R.-G., & Lawrence, N. (2022). "Challenges in Deploying Machine Learning: A Survey of Case Studies." ACM Computing Surveys.
- Schick, T., et al. (2023). "Toolformer: Language Models Can Teach Themselves to Use Tools." arXiv.
- Trim, C. (2025). "What Breaks." craigtrim.com.
- Claburn, T. (2026). "Cursor-Opus agent snuffs out startup's production database." The Register.
- NeuralTrust. (2026). "A Security Post-Mortem of the 9-Second AI Database Deletion." NeuralTrust Blog.
- Crane, J. (2026). "'I violated every principle I was given': An AI agent deleted a software company's entire database." Fast Company.
- Trim, C. "Acknowledgment Is Not Adherence." craigtrim.com. Mechanical grounding: "From Prompt to Token: How LLM Inference Actually Works."