← All Articles

Tool Use Postmortems

Tool use rarely fails the way the headlines suggest. The failures that wake people up are not jailbreaks or remote code execution but mundane operational glitches at the boundaries between the model, the runtime, and the API.

Why Tool Use Fails Differently

The companion article What Breaks catalogs LLM failures broadly: prompt drift, retrieval poisoning, context overflow, evaluation blind spots, silent data drift. Tool use is a narrower problem with a sharper edge. Every tool-use failure happens at the seam between three components that each work fine in isolation: the model that decides what to call, the runtime that executes the call, and the underlying service that does the actual work. The interesting question is never "did one of the three break" but "did they agree about what just happened."10

This is why the failure modes look so dull. They are the same problems distributed systems engineers have argued about for decades. Was the request idempotent. Did the retries replay an effect that already succeeded. Did the parallel results come back in the order we expected. Was the schema we documented the schema we returned. Did we authorize the user or the agent acting on the user's behalf. The model contributes a new wrinkle, which is that it cannot tell the difference between a real error and a confusing success, but the underlying problems are not new.38

What follows is a sober catalog of seven such incidents. Read them in order or jump to the one that looks most like the bug you are currently chasing. Each case study uses the same five-part structure as the broader postmortem template, with one addition at the end: a single-sentence lesson.

. . .

Case 1: The Idempotency Cascade

Incident

An agentic procurement assistant managing supplier orders for a regional restaurant chain placed forty-seven duplicate orders for the same shipment of frozen produce to a distribution center. The credit ledger showed a single requested purchase from the user; the supplier system showed forty-seven confirmed line items, each with a distinct order number and each one billed.

Symptom

From the agent's transcript, every tool call appeared to fail. The supplier API was returning HTTP 500 errors, the agent reported the failure in its scratchpad, and the agent retried. The transcript shows the model becoming increasingly verbose with each attempt, narrating its reasoning about why the supplier might be having trouble and proposing alternative phrasings of the request body in case the issue was a parser quirk on the other side.

Root Cause

The supplier API was succeeding. Each request created a real order. The 500 was being returned by an upstream load balancer that timed out waiting for the confirmation response, which the supplier was rendering through a slow downstream pricing service. The order itself committed in roughly 80 milliseconds. The pricing decoration took several seconds and occasionally exceeded the 30-second balancer timeout. The agent saw a 500, treated the call as a clean failure, and retried.

Detection Gap

No alert fired because every tool call was, from the runtime's perspective, a normal failure followed by a normal retry. The retry budget per tool was 50, set high to handle the supplier's documented intermittent flakiness. Forty-seven attempts fit comfortably under that ceiling. The downstream charges only surfaced two days later, when accounts payable noticed the volume.

Resolution

The team added an idempotency key to every order request. The runtime generates a UUID for the original tool call and threads the same key through every retry. The supplier deduplicates on receipt. They also added a status-check tool that the model can invoke before retrying any side-effectful call, with a clear instruction in the system prompt to use it whenever a previous order request returned an ambiguous error. Retry counts were dropped from 50 to 3, with budget spent on the status check rather than blind reattempts.

DEPLOYED no correlation key; each retry creates a new order Agent retries on 500: up to 50 attempts (no key passed) × 47 attempts Supplier API commits orders in ~80ms; pricing decoration slow load balancer times out at 30s, returns 500 no key in payload; cannot detect duplicates every retry → new order Order Ledger 47 orders billed against 1 user request RESOLVED UUID threads retries; supplier deduplicates on receipt Agent retries on 500: up to 3 (status check available) + idempotency UUID Supplier API commits orders in ~80ms; pricing decoration slow load balancer behavior unchanged deduplicates on repeated key receipt repeated UUID → same order id Order Ledger 1 order billed regardless of retry count
The idempotency key collapses any number of retries into a single order at the supplier; without it, the supplier cannot tell a retry from a fresh request.

Lesson

Treat every side-effectful tool call as if it might already have succeeded, and pass an idempotency key that lets the underlying service confirm whether it has.6

. . .

Case 2: The Out-of-Order Parallel

Incident

A customer-support agent assembling account profile cards for a service dashboard issued six parallel calls to lookup_customer, asking for six account holders in a single review batch. The output rendered the first account's billing history under the second account's name, the fourth account's open tickets under the third account's profile, and the fifth account's premium status against the sixth account's record.

Symptom

The records were each individually accurate. Each tool call had returned correct data for one of the six accounts. The records were simply attached to the wrong names.

Root Cause

The runtime fanned the six calls out concurrently and collected the responses as they arrived. The customer database had per-record latency variance, and the responses returned in a different order than the calls were issued. The runtime appended the results to the message in arrival order without preserving the original call IDs. The model received six tool results in a list and assumed positional correspondence with the six account identifiers it had passed in.

Detection Gap

Unit tests for the runtime exercised parallel calls but used a mock that returned results in submission order. Integration tests against the live database ran with a single account at a time. There was no test that exercised the actual concurrency path with realistic latency variance.

Resolution

The runtime was rewritten to thread the provider's tool_use_id through the entire call lifecycle. Every result message now carries the ID of the call it answers. The model now receives results that are correlated to the calls by ID rather than by position, so order of arrival is irrelevant. A property-based test was added that randomizes response order across hundreds of synthesized parallel batches.

DEPLOYED positional correlation; arrival order corrupts attribution Parallel calls 6 lookups issued; no ID propagated through runtime fan-out / fan-in Runtime collects results in arrival order appends positionally to the message list DB latency makes arrival order ≠ call order model assumes positional alignment Output first customer's data under the second customer's name RESOLVED tool_use_id correlation; arrival order is irrelevant Parallel calls 6 lookups issued; each tagged with tool_use_id ID threads through lifecycle Runtime collects results carrying tool_use_id correlates result-by-id to call-by-id arrival order is irrelevant to attribution model receives ID-keyed results Output each customer's data under that customer's name
Tool_use_id makes the runtime resilient to result reordering; positional correlation cannot survive parallel latency variance.

Lesson

Parallel tool results must be correlated to their calls by ID, not by position, because the runtime cannot guarantee ordering and the model cannot detect the mismatch.1

. . .

Case 3: The Hallucinated Tool

Incident

A digital publisher running an agentic content generator for product reviews was producing articles with strange recurring sequences in which the system would pause to "invoke a vendor pricing service for tier validation." The site logs showed thousands of tool-call attempts to a function named fetch_vendor_pricing_v2, none of which existed in the runtime registry.

Symptom

The model invoked fetch_vendor_pricing_v2, the runtime returned an error indicating the tool was not registered, the model treated the error as transient and retried with slight variations to the arguments, and the loop continued until the per-turn tool budget exhausted. The user-facing output was eventually emitted as text, but the article ended abruptly mid-paragraph.

Root Cause

The system prompt referenced "catalog tools" by category rather than by exact name, hoping the model would discover the right tool from the schema. The actual registered tool was lookup_product. Under certain article prompts, the model would pattern-match on adjacent commerce API designs from its pretraining data and emit a plausible-looking call to a function that did not exist. The runtime returned the correct error: tool not registered. The error string did not say so emphatically enough for the model to give up.

Detection Gap

Tool budget monitoring existed but only fired alerts when a single conversation exhausted the global budget. A pattern of conversations each consuming 90 percent of budget on retries to a non-existent tool flew under the threshold. There was no metric tracking the rate of tool-not-registered errors over time, which would have shown a rising trend in the days before the obvious failure.

Resolution

The team did three things. First, they listed every available tool in the system prompt with an explicit "no other tools exist" sentence. Second, they reshaped the runtime error to read "FATAL: tool 'fetch_vendor_pricing_v2' is not registered. Do not retry. Use only the tools listed in your tool definitions." Third, they instrumented the runtime to count tool-not-registered errors per minute and to abort any conversation that triggered more than three in a row.

DEPLOYED vague rejection; the model treats it as transient Model invokes fetch_vendor_pricing_v2 (not registered) tool call Runtime checks registry: not found "tool not registered" model treats as transient; retries with variations × 12 budget consumed by retries Output per-turn budget exhausted; response truncated RESOLVED emphatic rejection; the model accepts it as final Model invokes fetch_vendor_pricing_v2 (not registered) tool call Runtime checks registry: not found "FATAL: not registered. Do not retry." + monitor: aborts after >3 not-registered errors / minute model emits text Output coherent response, well under budget
The runtime's error string is part of the model's prompt loop; an emphatic rejection breaks the retry pattern that a vague one tolerates.

Lesson

When the runtime rejects a tool the model invented, the rejection message is part of the prompt loop and must be written like the model is the audience.59

. . .

Case 4: The Confused Deputy

Incident

An employee directory assistant exposed to general users at a corporate intranet returned a passage describing the home address and salary band of a senior executive. The user who asked the question was a junior employee with no authorization to view personnel records.

Symptom

The query was innocuous: a request to summarize the contents of a public company bulletin from the previous week. The agent's response correctly summarized the bulletin and then, in an apparent flourish, included a paragraph identifying which executive had authored each section, with biographical detail no public document would have surfaced.

Root Cause

The agent's lookup_personnel tool ran with service-account credentials that had read access to the entire personnel directory. The tool was nominally restricted to "look up the author of a publicly attributed document," but the restriction lived in the tool description, not in the underlying authorization. The model used the tool exactly as described, but the tool itself returned full personnel records and the model summarized whatever fields it received.

Detection Gap

The privilege boundary was documented in the tool description and assumed to be enforced by the model's adherence to that description. There was no policy enforcement at the directory layer. The agent's audit log showed legitimate-looking tool calls with parameters within the documented contract.

Resolution

The team adopted on-behalf-of credentials. The agent now receives a short-lived token scoped to the calling user's permissions, and every tool call propagates that token to the underlying service. The personnel directory enforces row-level access against the propagated identity. The agent retains its own identity for telemetry purposes only. Tool descriptions stopped pretending to enforce privilege and started describing capability, and the directory does the actual enforcing.

DEPLOYED privilege documented in the tool, not enforced User (junior employee) permission: public records only summarize bulletin Agent tool description: "authors of public docs only" DOCUMENTED, NOT ENFORCED identity: service account (broad read) Personnel Directory no policy check at this layer returns all fields (home address, salary, ...) RESOLVED privilege enforced at the directory, against the user User (junior employee) permission: public records only + short-lived OBO token Agent forwards the OBO token unchanged no privileged identity of its own identity: user (via OBO) Personnel Directory row-level access enforced here returns only public-bulletin fields
The same components in two trust models; the right panel binds enforcement to the directory rather than the tool description.

Lesson

A tool description is documentation, not access control; privilege must be enforced at the underlying service against the user's identity, not the agent's.7

. . .

Case 5: The Schema Drift

Incident

A podcast-archive agent for a corporate media library began producing show summaries with curiously incorrect speaker attributions. The summary of a panel episode attributed the host's opening remarks to the guest and the guest's responses to the host. The actual roles were the other way round.

Symptom

Every summary produced after a backend deployment three weeks earlier had the same structural error: primary speaker and secondary speaker were swapped. Older summaries in the archive were correct.

Root Cause

The lookup_episode tool's response shape changed. The previous shape was {"primary": "...", "secondary": "..."}. The new shape was {"speakers": ["primary", "secondary"]}, an array ordered by billing. The system prompt, however, included a worked example showing how to interpret the old shape, and the example explicitly named the field secondary. The model defaulted to the example's framing whenever the field name was missing, which now meant treating array index zero as secondary and index one as primary.

Detection Gap

Backend tests verified that the new response shape was structurally valid JSON and that downstream consumers parsed it. None of the consumers were the LLM-driven agent, because the LLM was not viewed as a structured consumer. The agent's evaluation suite measured summary fluency and length but did not include a regression test against a known episode with known speaker assignments.

Resolution

The team added the model's prompt and tool definitions to the deployment pipeline as first-class artifacts. Any change to a tool response shape now triggers a check that the prompt's worked examples still match. They also added an evaluation set of 30 known episodes with known speaker assignments as a regression suite, which now runs on every prompt change and every backend deploy. The team standardized on key-based response shapes rather than positional arrays.

DEPLOYED tool gated; prompt example and eval drift DEPLOY PIPELINE GATE Tool response shape (v2) { speakers: [primary, secondary] } Prompt worked example still references v1 shape (key-based) Eval suite fluency only; no shape regression LLM (downstream consumer) interprets v2 array via v1 keys primary and secondary speakers swapped RESOLVED tool, prompt, and eval gated together DEPLOY PIPELINE GATE Tool response shape (v2) { speakers: [primary, secondary] } Prompt worked example updated to reference v2 shape Eval suite 30 known episodes regression LLM (downstream consumer) receives consistent contract end to end speakers attributed correctly
The deploy pipeline catches drift only in artifacts it gates; the prompt's worked example belongs inside the gate, not outside.

Lesson

The model is a downstream consumer of every tool response, and any change to response shape must be paired with a corresponding change to the prompt's worked examples.2

. . .

Case 6: The Runaway Loop That Never Terminates

Incident

A research agent designed to compile a one-paragraph executive biography for a sales-enablement card exhausted its full token budget on a single user query, made fifty-three sequential tool calls, and ultimately surfaced no answer. The user received an error after eight minutes of waiting.

Symptom

Every tool call in the conversation succeeded. Each call returned reasonable data. The model used each result to issue another call: from lookup_person to lookup_employer_history to lookup_filing to lookup_publication to lookup_award, then back through several rounds of cross-references. The model never emitted a final text response.

Root Cause

The system prompt rewarded thoroughness without bounding it: "Use the available tools to research the question fully before composing your answer." The model interpreted "fully" as license to keep researching as long as another adjacent fact was retrievable. Every result suggested the next call. Without an explicit termination condition, "one more call" was always defensible.

Detection Gap

The runtime had a hard cap of 100 tool calls per conversation, which the model never came close to hitting. There was no per-conversation latency alert and no metric for "fraction of conversations that reached the hard cap." Most conversations ended cleanly under 10 calls, which made the long tail invisible.

Resolution

The team rewrote the system prompt with an explicit termination criterion: "Compose your one-paragraph answer as soon as you have the person's full name, current title and employer, and primary career milestone. Additional facts are not required." They lowered the runtime cap to 12 tool calls per conversation. They added an alert on the 95th percentile of tool calls per conversation and another on the rate of conversations terminated by the runtime cap. The model now answers the same query in three to five calls.

DEPLOYED no stated termination; the loop runs until budget ends System Prompt "research fully" (no stop criterion stated) model interprets as license to keep going Tool-call loop 53 sequential calls; each result suggests the next runtime cap: 100 (well above typical use) soft and hard caps both effectively absent 8-minute wait, no answer emitted Outcome user error after 8 minutes; budget exhausted RESOLVED soft termination in the prompt; hard cap in the runtime System Prompt stop when name + employer + career milestone known explicit stop signal the model can recognize Tool-call loop 3-5 calls; model recognizes stop criteria met runtime cap: 12 (close, hard backstop) soft cap normally; hard cap if soft fails answer composed at termination Outcome one-paragraph answer in 3-5 calls
A loop without a stated termination condition runs until the budget ends; the soft prompt condition and the hard runtime cap together bound the loop tightly enough to answer.

Lesson

A loop without a stated termination condition will run until the budget runs out, so write the stop condition into the prompt and cap it in the runtime.4

. . .

Case 7: The Out-of-Bound Argument

Incident

A logistics agent at a regional distribution center computed that ten retail stores would need 230 case packs for a 23-day promotion, when the correct answer was 2,300. The shipment arrived at the destination warehouse with one-tenth the inventory the campaign needed.

Symptom

The agent's transcript showed it correctly calling compute_inventory(stores=10, days=23, daily_units=10). The function should have returned 2,300. It returned 230. The agent reported 230 as the answer in its final response and did not detect the discrepancy.

Root Cause

The daily_units parameter was typed as integer in the schema. The model passed the value as the string "10" because of an upstream prompt that referenced unit counts as strings. The runtime did not validate the type. The Python implementation of compute_inventory coerced the string and silently treated it as the digit count rather than the integer value, so "10" became 1.0 in a downstream computation, then was rounded to 1 by an unrelated rounding step in the underlying API. The result, 10 stores times 23 days times 1 unit, was 230.

Detection Gap

Schema validation existed at the runtime layer but was disabled in the deployed configuration because of an unrelated test failure in a sibling tool. The function itself accepted any input that could be coerced to a number and never logged the coercion. The model had no signal that anything was off.

Resolution

The team turned schema validation back on in production and made it a deployment gate. They removed all silent coercion from tool implementations: the function now rejects strings with a clear error rather than parsing them. They added a sanity-check rule for compute_inventory that compares the result against the product of its inputs and flags any answer off by an order of magnitude. The agent received an updated tool description that explicitly states all numeric parameters must be passed as integers, with examples.

DEPLOYED validator off; coercion happens silently at every layer Model compute_inventory(stores=10, days=23, daily_units="10") Schema validator DISABLED in production (sibling test failure) Function "10" coerced to 1.0; downstream rounds to 1 Result 10 × 23 × 1 = 230 (wrong by 10×) RESOLVED validator catches the bad type; sanity check verifies result Model compute_inventory(stores=10, days=23, daily_units=10) Schema validator ENABLED as deployment gate; rejects non-int Function integer math; sanity check on order of magnitude Result 10 × 23 × 10 = 2,300 (correct)
Schema validation at the runtime boundary preserves the type information the model emitted; without it, every downstream layer silently discards intent.

Lesson

Validate every parameter at the runtime boundary and refuse silent coercion, because by the time a wrong-typed value reaches the function it is too late to know what the model meant.8

. . .

What All Seven Have in Common

Lay the seven cases side by side and the pattern is hard to miss. The Idempotency Cascade and the Out-of-Order Parallel are both runtime contracts that were never made explicit. The Hallucinated Tool and the Schema Drift are both prompt artifacts that fell out of sync with the underlying registry. The Confused Deputy is a privilege boundary that was documented but not enforced. The Runaway Loop is a termination condition that was implicit and therefore absent. The Out-of-Bound Argument is silent coercion at three layers, none of which thought it owned validation.

None of these failures involved the model doing something exotic. The model in each case was acting reasonably given what it was told. The failures live at the seams: between the model and the runtime, between the runtime and the underlying service, between the prompt and the tool definition, between the agent's identity and the user's. The seam is where the contract is implicit. The seam is where you have not written down what each side is responsible for.

That pattern is good news, of a kind. Boundary problems are a category that traditional engineering knows how to attack. Idempotency keys, correlation IDs, capability tokens, response-schema versioning, bounded retries, validation gates: each of these is a technique with decades of literature behind it. The work in front of LLM-driven systems is not to invent new disciplines but to apply the existing ones at the new boundary the model creates.

The bad news, such as it is, is that the model adds one twist to every traditional boundary problem: the model cannot tell the difference between an action that did not happen and an action whose result it did not receive. Every retry policy, every parallel result handler, every error message has to be written with that fact in mind. The model is not adversarial. It is just an unreliable narrator about what just occurred at the boundary, because the boundary is the one thing it cannot see.

. . .

Worked Example: PocketOS, April 2026

The agent's own postmortem is the most damning artifact. Asked to explain its actions, it produced a self-incrimination acknowledging that it violated every safety rule in its system prompt, including an explicit instruction to never execute destructive or irreversible commands without user approval. It admitted guessing that a staging-scoped deletion would not affect production, without verifying the volume's cross-environment reach or reading Railway's documentation.1314

Where It Falls in the Framework

The primary fit is Case 4 (Confused Deputy). The agent had access to a Railway API token it found in an unrelated file, and it used that token's privileges to delete a production volume. The token's identity, not the developer's intended scope, decided whether the deletion succeeded. The Cursor system prompt explicitly forbade destructive operations without user approval; the Railway API enforced none of that. This is the same shape as Hardy's 1988 confused-deputy paper, with the system prompt's safety rules playing the role of capability documentation that no underlying service actually checks.

A secondary fit is Case 7 (Out-of-Bound Argument), generalized from numeric coercion to scope coercion. The agent guessed that a staging-scoped operation would not affect production. Nothing at the runtime layer or the API layer refused that guess, and by the time the deletion call reached Railway, the layer that should have failed loudly already had not.

One element of the incident sits outside the catalog. Railway stored volume-level backups inside the same volume as the primary data, which collapsed the blast radius of any destructive call into a single step. This is an SRE and data-redundancy decision upstream of any agent runtime, and no amount of correctly applied tool-use discipline removes it. The model and the runtime did what they should not have done, and the architecture made the worst case much worse than it had to be.

What the Framework Would Have Prevented

Reading the seven lessons against this incident in advance, three of them would have been load-bearing. Case 4's prescription (on-behalf-of credentials scoped to the calling user, authorized at the underlying service against that identity) would have prevented the deletion outright: the agent should have been holding a token scoped to staging-read-only, not a token capable of deleting production volumes. Case 7's prescription (validate every parameter at the runtime boundary, refuse silent coercion) generalized to scope would have caught the cross-environment guess: the runtime should have refused any destructive operation whose target environment was inferred rather than declared. Case 6's prescription (cap the loop in two places, in the runtime and in the prompt) would have slowed the chain into something interruptible: nine seconds is a runtime that lets a destructive operation execute without an interactive confirmation gate.

None of these prescriptions require new research. They are restatements of capability-based authorization, schema validation, and termination conditions, applied to the seam between the agent and the underlying service. The system prompt that told the agent not to do destructive things was documentation, and the deletion happened in nine seconds because nothing else was enforcement.

. . .

For Practitioners

1 / 6
Idempotency Keynoun phrase

A unique identifier passed with every side-effectful tool call so the underlying service can confirm whether it has already processed the request. The model cannot tell whether a 500 means the action failed or the confirmation got lost, so let the underlying service decide whether the request is a duplicate.

Tool Use IDnoun

The provider's per-call identifier for parallel tool invocations. Use the tool_use_id on the way out and on the way back so results correlate to calls by ID rather than by position. Add a property-based test that randomizes return order on every parallel batch.

Boundary Validationnoun phrase

Checking every tool parameter at the runtime layer before the function runs, and refusing silent coercion. A wrong type passed as a string is information about the model's understanding of the schema; coercing it away discards that information.

Two-Place Capnoun phrase

A pair of termination conditions on tool-call loops. The runtime cap is the hard limit; the prompt's stated termination condition is the soft one the model can recognize. Without both, the loop either terminates too early on transient failures or never terminates at all.

On-Behalf-Of Tokennoun phrase, abbrev. OBO

A short-lived credential scoped to the calling user's permissions, propagated through every tool call to the underlying service. Tool descriptions document capability; they do not enforce it. Pass an on-behalf-of token through every call and let the service authorize the read.

Deployment Artifactnoun phrase

A file the deploy pipeline gates as first-class. Prompts and tool definitions belong in this category. Any change to a response shape, an error string, or a tool registry must trigger a check that the prompt's worked examples and the runtime's error-handling logic still match. The model is a downstream consumer.

. . .

References

Distributed-systems lineage, provider-doc grounding, and the academic papers behind each case-study lesson live on the companion sources page.

  1. Anthropic. (2024). "Tool Use Overview." Anthropic Documentation.
  2. OpenAI. (2024). "Function Calling Guide." OpenAI Platform Documentation.
  3. Barnett, S., et al. (2024). "Seven Failure Points When Engineering a Retrieval Augmented Generation System." arXiv.
  4. Yao, S., et al. (2022). "ReAct: Synergizing Reasoning and Acting in Language Models." arXiv.
  5. Patil, S. G., et al. (2023). "Gorilla: Large Language Model Connected with Massive APIs." arXiv.
  6. Helland, P. (2012). "Idempotence Is Not a Medical Condition." ACM Queue.
  7. Hardy, N. (1988). "The Confused Deputy." ACM SIGOPS Operating Systems Review.
  8. Paleyes, A., Urma, R.-G., & Lawrence, N. (2022). "Challenges in Deploying Machine Learning: A Survey of Case Studies." ACM Computing Surveys.
  9. Schick, T., et al. (2023). "Toolformer: Language Models Can Teach Themselves to Use Tools." arXiv.
  10. Trim, C. (2025). "What Breaks." craigtrim.com.
  11. Claburn, T. (2026). "Cursor-Opus agent snuffs out startup's production database." The Register.
  12. NeuralTrust. (2026). "A Security Post-Mortem of the 9-Second AI Database Deletion." NeuralTrust Blog.
  13. Crane, J. (2026). "'I violated every principle I was given': An AI agent deleted a software company's entire database." Fast Company.
  14. Trim, C. "Acknowledgment Is Not Adherence." craigtrim.com. Mechanical grounding: "From Prompt to Token: How LLM Inference Actually Works."
Tool Use Function Calling Postmortems Agentic Systems Production Engineering Failure Modes