← All Articles

Tool Loops: Multi-Step and Parallel Calls

One tool call rarely finishes the job. Real workflows chain calls, run them in parallel, and recover from intermediate failures. The loop terminates when the model stops asking for tools, and that condition is the simplest definition of an agent we have.

In Brief

The function-calling primitive covered in the previous article handles a single round trip: model emits a tool call, runtime executes it, result returns, model produces text. Production systems almost never look like that. They iterate. The model issues a call, sees the result, decides whether the task is complete, and either responds or calls another tool. The loop terminates when the model produces a text-only message with no further tool calls. That probabilistic stopping condition is what makes the construct an agent rather than a script, and it is also what makes it brittle.

This article walks the four-move cycle in detail, contrasts sequential and parallel call patterns, catalogs five distinct failure modes that any robust runtime must handle, and discusses when to terminate the loop and how to keep it from running away with the budget. The argument is that even sophisticated agentic loops are still while-loops with a probabilistic termination condition. Treating them that way (with retry caps, budgets, idempotency, and explicit error handling) is what separates a working demo from a system you can trust.

From One Call to Many

The single-turn function call covered in How Function Calling Actually Works establishes the primitive. The model is given a tool definition, decides whether the user's request needs that tool, and if so, emits a JSON object that names the tool and its arguments. The runtime executes the tool. The result returns. The model produces a final text answer. One round trip, one decision, one result.¹

That shape is enough for a weather lookup or a unit conversion. It is not enough for almost anything else. Consider a slightly larger task: find the entry for "Earth" in the Hitchhiker's Guide to the Galaxy, check whether it has been updated to reflect the planet's recent unscheduled demolition, and if so, return the new entry alongside the old one for comparison. That request needs at least three tool calls. The first looks up the entry by name. The second checks the revision history. The third fetches a specific prior revision by ID. Each call depends on the previous one's result, and the IDs cannot be guessed in advance.

What the runtime is being asked to do is straightforward in shape: keep talking to the model, keep handing it intermediate results, and stop when it stops asking for things. What the runtime is actually doing is more interesting. It is building a transcript that grows on every turn, feeding that transcript back into the model, and trusting a probabilistic process to decide when to halt. The next sections describe what that loop looks like, what can go wrong inside it, and how to keep it from spinning forever.

. . .

The Loop, Step by Step

Every iteration of the tool loop is the same four-move cycle, repeated until the model stops emitting tool calls. The moves are ordered, and each one is the responsibility of a different actor.²

Inference. The runtime sends the conversation history (system prompt, prior turns, tool definitions, and any prior tool results) to the model and waits for a response.
Dispatch. If the response contains one or more tool_use blocks, the runtime parses the block, validates the arguments against the schema, and routes each call to the appropriate function.
Execution. The function runs. It may succeed, raise an exception, time out, or return something that does not match the documented response shape. Whatever happens, the runtime captures the result.
Re-entry. The runtime appends the assistant's tool calls and the corresponding tool_result blocks to the conversation history, then loops back to step 1.

The cycle ends in step 1 when the model returns a message with text content and no tool_use blocks. That text is the final answer. The model has decided, based on the cumulative transcript, that it has enough information to respond. There is no other termination signal. The runtime cannot inspect the model's confidence, cannot ask whether it is finished, cannot tell the difference between "task complete" and "I do not know what to do next, so I will summarize what I have." The presence or absence of a tool call is the entire stopping rule.³

The simplest implementation is a while-loop with a counter. Pseudocode:

def run_agent(prompt, tools, max_iterations=10):
    messages = [{"role": "user", "content": prompt}]

    for step in range(max_iterations):
        response = model.create(messages=messages, tools=tools)
        messages.append({"role": "assistant", "content": response.content})

        # Termination: model returned text only
        if not any(b.type == "tool_use" for b in response.content):
            return response.text

        # Otherwise: execute every tool call this turn
        results = []
        for block in response.content:
            if block.type == "tool_use":
                results.append(execute_tool(block.name, block.input, block.id))

        messages.append({"role": "user", "content": results})

    raise RuntimeError("Hit max_iterations without termination")

That fits on a single screen. It is also the entire structural skeleton of every commercial coding agent, every research assistant, every travel-booking bot. The intelligence is in the model and the tool definitions. The loop itself is mechanical. Recognizing this is clarifying, because it tells you exactly which class of bugs you are signing up for: budget overruns, infinite retries, malformed arguments, partial failures, and stuck states. Software engineering has a long memory of these problems, and most of the techniques that worked for retry policies in distributed systems work here too.

. . .

Sequential vs Parallel

Some tasks are inherently sequential. To find Frodo, you first look up the Fellowship roster to retrieve his current member ID, then validate that the ID is still active (he may have left the party in Lothlórien), and only then query the location service with that ID. Each step depends on the prior step's output, so the model has no choice but to issue one call, wait for the result, and decide what to do next. That is a chain.

Other tasks are inherently parallel. Suppose the user asks for a side-by-side biography of six members of the Fellowship: Frodo, Sam, Aragorn, Legolas, Gimli, and Boromir. There is a single tool, get_character(id), that fetches one biography at a time. The six calls have no data dependency on each other. Modern providers let the model emit all six tool_use blocks in a single assistant message, and the runtime is expected to execute them concurrently, collect the six tool_result blocks, and feed all of them back in a single user message before the next inference.⁴

The two shapes look very different in the transcript.

Two transcript shapes for two different tasks: a dependent chain on the left, an independent batch on the right.

Parallel reduces wall-clock latency, sometimes dramatically. Six 200ms calls in sequence is 1.2 seconds; six in parallel is 200ms plus overhead. The cost is structural complexity. The runtime must execute the calls concurrently (a thread pool, an async event loop, or a futures-based gather), collect every result before the next inference, handle the case where some calls succeed and others fail, and preserve the linkage between each tool_use ID and its corresponding tool_result ID so the model can correlate them.

Failed parallel calls compound. One bad result poisons the batch unless every call is independently validated before any of them are surfaced to the model. If the Boromir lookup throws because the underlying service had a hiccup, you do not want the model to receive five biographies and one stack trace and then guess at how to summarize the set. You want the failure isolated, retried if appropriate, and reported in a way the model can reason about. The next section catalogs what those failures actually look like.

One more nuance: the model decides whether to issue parallel calls. The runtime can advertise the capability, but it cannot force it. If the model does not see the calls as independent, it will issue them sequentially regardless. Tool descriptions that explicitly say "this call is independent and may be issued in parallel with others" tend to nudge the model toward batching, but the decision is probabilistic, not guaranteed.

. . .

The Five Failure Modes

A tool call can fail in five distinct ways, and each one needs a different response. Lumping them together (catching Exception at the dispatch layer and feeding the message back to the model) is a common pattern in tutorial code and a common cause of runaway loops in production. The taxonomy that follows separates the cases by where in the cycle the failure occurs and what the runtime can actually do about it.⁵

Failure 1: Invalid JSON

The model emits a tool_use block whose input field cannot be parsed as JSON. A trailing comma, an unescaped quote, a smart quote substituted for a straight one, an unterminated string. The runtime tries to deserialize the arguments and the parser raises before the dispatch step even gets to validation.

This is rarer than it used to be. Modern providers enforce JSON validity at the inference layer through constrained decoding, so a strict provider response will almost always parse.⁶ When it does happen, the cause is usually that the model is operating outside structured output mode, or that the runtime is concatenating streamed deltas incorrectly and producing a truncated buffer. The recovery is mechanical: feed the parser error back to the model in the next tool_result, ask it to reissue the call, and cap the retries at two or three. If the model cannot produce parseable JSON after three tries, the issue is not transient and the loop should terminate with a hard error.

Failure 2: Valid JSON That Violates the Schema

The arguments parse cleanly but do not match the declared schema. A required field is missing. A string is supplied where a number is expected. An enum value falls outside the allowed set. The model has invented a plausible-looking argument that the schema explicitly forbids.

Imagine a summon_daleks(quantity, exterminate_target, location) tool whose quantity field is constrained to a positive integer between 1 and 12 (anything more would overload the time corridor). The model emits {"quantity": "many", "exterminate_target": "the Doctor", "location": "Skaro"}. The JSON is valid. The schema is not. The runtime catches the violation in the validation layer, before the dispatch ever reaches the dispatch function. Recovery is the same as Failure 1: feed a structured validation error back to the model with the specific field, the constraint, and the offending value, and let it retry. The companion article on reliable tool schemas goes into how to design schemas that minimize this class of failure in the first place.

Failure 3: Valid Arguments That the Service Rejects

The arguments parse, the schema is satisfied, the dispatch executes, and the underlying service returns an error. The Daleks were summoned to Skaro with a quantity of 12, which is technically allowed, but the time corridor is currently blocked because of the events of the previous episode and the API responds with {"error": "TIME_CORRIDOR_UNAVAILABLE", "retry_after": 1800}.

This is a business-logic failure, not a syntactic one. It cannot be fixed by tightening the schema, because the schema and the actual operational state of the service are different things. The runtime should surface the error to the model as a structured tool_result, ideally in the same shape the service returned it. The model can then decide whether to retry with different arguments, fall back to a different tool, or report the failure to the user. What the runtime should not do is silently retry the same call. Three identical retries in a single iteration is the agentic equivalent of a stuck thread, and it burns budget without making progress.

Failure 4: Valid Execution That Returns an Unexpected Shape

This is the most insidious of the five. The call succeeds. The service returns a 200. The result deserializes. But the structure of the result has drifted in some way that the model is not prepared for. Last week the flesh_wound_severity tool returned {"severity": "tis but a scratch", "loss": ["arm"]}. This week, after a quiet schema migration, it returns {"severity_code": 1, "anatomy": {"removed": ["arm"]}}. The Black Knight insists nothing has changed. The runtime cannot tell the difference. The model receives the new shape, attempts to extract severity, finds it missing, and produces a coherent-sounding answer that is wrong.⁷

The recovery for this one cannot live entirely in the loop. The runtime should validate tool results against an expected shape, the same way it validates tool arguments, and reject responses that do not conform. The model can be told the shape was unexpected, but it does not know what the new shape ought to be. The real fix is contract testing between the agent and the tools, with the contract checked on every deploy. Treating tool results as untrusted input has a security dimension as well, but the immediate concern is correctness: a result that does not match its declared shape is a bug at the boundary, and silent acceptance is how that bug reaches the user.

Failure 5: Network Timeout

The runtime sends the request. The service does not respond within the timeout. The runtime cancels and surfaces a timeout error. From the model's perspective, the call did not complete. From the service's perspective, the call may have completed, may have partially completed, or may not have started.

This is the classic distributed systems problem: a timeout is not a failure signal, it is an absence of a success signal. A 42-second timeout on the Hitchhiker's Guide entry lookup tool tells you nothing about whether the entry was actually retrieved. If the operation has side effects (booking a hotel, sending a message, scheduling a regeneration), naive retry is dangerous, because it duplicates the action. Tool cascade failures, where a single user request produces multiple bookings or charges, are almost always the result of timeout-driven retries on a non-idempotent endpoint.⁸

Recovery here is structural, not conversational. Tools that have side effects should accept idempotency keys, and the runtime should generate a stable key per logical call so that a retry returns the cached result of the first attempt rather than executing the action twice. Tools that are read-only can be retried with exponential backoff. Tools that have ambiguous semantics should be split into a start call and a check status call, so the agent can determine whether the previous attempt succeeded before issuing another one. The model is not going to figure out idempotency on its own. That work belongs to the runtime and the tool layer.

. . .

When to Terminate

The natural stopping condition is the model emitting a text-only response. The unnatural ones are the cases where the model keeps asking for tools and never decides it is done. Three cases dominate.

A small robot in a workshop surrounded by walls densely covered with wrenches, drills, and tools, with a faint question mark above its head. — One more should do it.

The first is the genuine multi-step task that simply needs more iterations than your default budget allows. A research-style query that walks a graph of related entities can easily issue twenty or thirty tool calls before producing a final summary. If the runtime caps iterations at five, the loop terminates with the model halfway through its plan, and the summary it produces is a partial answer presented as a complete one. The fix is to set max_iterations based on the task profile, not a one-size default, and to log when the cap is hit so you can distinguish "done" from "ran out of room."

The second is the looping bug, where the model issues the same call (or a near-identical call) repeatedly because the result it is getting back does not move the conversation forward. Often this is a Failure 3 or Failure 4 in disguise: the service is returning an error or an unexpected shape, the model is interpreting the result as "the call did not work, try again," and the runtime is dutifully obliging. A simple defense is to detect call repetition: if the same tool name and the same arguments appear twice in a row, log it, and on the third occurrence terminate with a structured error rather than continuing to spend tokens.

The third is the budget overrun, where the loop is making progress but the cumulative token cost or wall-clock cost exceeds what the system is willing to pay for a single user request. Every loop should have at least three caps: a maximum iteration count, a maximum cumulative token spend, and a maximum wall-clock time. Hit any of them and the loop terminates with a partial result and an explicit failure mode in the response. The Total Perspective Vortex of an unbounded agent loop is that it can spend an arbitrary amount of money producing nothing useful, and the only thing standing between you and that outcome is the budget you remembered to set before the request began.⁹

None of this is exotic. It is the same retry policy work, the same circuit breaker work, the same budget enforcement work that distributed systems engineers have been doing for decades. The only thing that is new is that the inner agent is probabilistic, so the policies cannot rely on the inner system being well-behaved. They have to assume it is not.

. . .

The Honest Assessment

It is fashionable to call these systems "agentic," and the word is doing a lot of work. The construct is genuinely useful and qualitatively different from a stateless prompt-response API. It is also, structurally, a while-loop with a probabilistic termination condition, a list of tool definitions, and a transcript that grows monotonically with each turn. There is no planning layer separate from the model. There is no goal representation the runtime can inspect. The agent's "decision" to keep going or to stop is a single token-prediction event in the stream, indistinguishable from any other.

That is not a complaint. The simplicity is a feature. A while-loop with five well-handled failure modes is something a team can reason about, test, monitor, and debug. A black-box "reasoning agent" with a private internal planner and an opaque termination heuristic is something a team can mostly hope works. The construct described in this article is closer to the first thing than the second, and that is the right place to be when the system has to operate in production. The failures it produces are diagnosable. The budgets it enforces are explicit. The recovery paths it follows are auditable.

What it cannot do is plan in any deep sense, hold a goal across many minutes of work, or recognize when its current strategy is failing in a way that requires backing out and starting over. Those are research problems. The pragmatic path is to keep the loop small, the failure modes covered, and the human in the recovery path for anything where being wrong is expensive. Don't Panic, but do bring a towel: the loop will produce surprises, and the surprises are easier to handle when you have built explicit handling for the cases described above.

. . .

For Practitioners

1 / 6

Budgets

Treat the loop as a while-loop with budgets. Cap iteration count, cumulative token spend, and wall-clock time. Hit any cap and terminate with a structured failure rather than continuing to spend.

Failure taxonomy

Distinguish the five failure modes in code. Invalid JSON, schema violation, service rejection, unexpected result shape, and network timeout each need a different recovery path. A single except Exception block is the agentic equivalent of on error resume next.

Idempotency

Make tools that have side effects idempotent. Generate a stable idempotency key per logical call. A timeout-driven retry should return the cached result of the first attempt, not execute the action twice. The single most important defense against tool cascade failures.

Result validation

Validate tool results, not just tool arguments. A result that does not match its declared shape is a bug at the boundary. Reject it loudly, log it, and surface a structured error. Silent acceptance is how schema drift reaches the user.

Repetition guard

Detect call repetition. If the model issues the same tool with the same arguments three times in a row, terminate the loop. The model is stuck, and the runtime is the only thing that can notice.

Logging

Log every call and every result. When a loop misbehaves in production, the only way to understand what happened is to replay the transcript. Structured logs of every tool_use and tool_result pair, with timing and status, pay for themselves the first time you have to debug a failure.

. . .

References

OpenAI. "Function calling." OpenAI Platform Documentation.
Anthropic. "Tool use with Claude." Anthropic Documentation.
Yao, S., et al. (2022). "ReAct: Synergizing Reasoning and Acting in Language Models." arXiv:2210.03629.
Anthropic. "Parallel tool calls." Claude API documentation.
Barnett, S., et al. (2024). "Seven Failure Points When Engineering a Retrieval Augmented Generation System." arXiv:2401.05856.
OpenAI. "Structured Outputs." OpenAI Platform Documentation.
Fowler, M. (2006). "Consumer-Driven Contracts: A Service Evolution Pattern." martinfowler.com.
Amazon Web Services. "Timeouts, retries, and backoff with jitter." Amazon Builders' Library.
Wang, G., et al. (2023). "A Survey on Large Language Model based Autonomous Agents." arXiv:2308.04026.
Schick, T., et al. (2023). "Toolformer: Language Models Can Teach Themselves to Use Tools." arXiv:2302.04761.
Patil, S., et al. (2023). "Gorilla: Large Language Model Connected with Massive APIs." arXiv:2305.15334.
Qin, Y., et al. (2023). "ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs." arXiv:2307.16789.

Function Calling Tool Use Agents Failure Modes Retry Logic Software Engineering