Sources
Grounding, citations, and further reading for Tool Loops: Multi-Step and Parallel Calls.
All of this is optional. These are the sources used to write the article, listed here for grounding and so anyone who wants to go deeper on a specific point knows where to look.
The article itself is self-contained. Nothing on this page is required reading.
About the Sources
Provider documentation: OpenAI and Anthropic
The OpenAI and Anthropic guides are the canonical wire-format references. They describe the request shape, the response shape, the parallel-call structure, and the JSON-mode contract that every modern function-calling implementation follows. Useful as a baseline for what the protocol guarantees and where implementations diverge.
Foundational research papers (2022-2024)
The arXiv papers that shaped the modern function-calling pattern. ReAct established interleaved reasoning-and-action. Toolformer trained a model to emit tool calls as a learned token-emission pattern. Gorilla and ToolLLM scaled tool use to thousands of APIs. Wang et al. survey the resulting agentic-systems literature.
Distributed-systems writing
Three sources that pre-date the LLM era but apply directly to tool loops. Martin Fowler on consumer-driven contracts (the model is a consumer, the tool service is a producer). The AWS Builders' Library on timeouts, retries, and exponential backoff with jitter. Barnett et al.'s catalog of seven failure points, originally framed for RAG, that maps cleanly onto tool-call pipelines.
From One Call to Many
1The single-turn function-calling primitive ↩ Back to article
OpenAI's function-calling guide is the canonical vendor description of the single round trip the article opens with: the model emits a tool call, the runtime executes it, the result returns, the model produces text. The guide presents this as a complete capability rather than the building block it actually is. Useful as a starting point precisely because it makes the limitations of the single-turn shape obvious by omission.
OpenAI Platform Documentation. Read the guide
The Loop, Step by Step
2The four-move cycle in vendor terms ↩ Back to article
Anthropic's tool-use documentation describes the four-move cycle the article walks through. Move one: send the conversation plus tool definitions. Move two: model emits a tool_use block. Move three: runtime executes. Move four: runtime appends the tool_result and re-enters. The doc is explicit that each move belongs to a different actor, which is the point the article opens with.
Anthropic Documentation. Read the guide
3The probabilistic stopping rule ↩ Back to article
Yao et al.'s ReAct paper introduces the pattern of interleaved reasoning and action with the action layer external to the model. The article's claim that "the presence or absence of a tool call is the entire stopping rule" is the practical consequence of ReAct's design: the model cannot signal completion through any channel other than not asking for another action. ReAct also documents the failure mode where the model continues to reason after the task is functionally complete, which is the source of many runaway loops.
Yao et al., 2022. Read on arXiv
Sequential vs Parallel
4Multiple tool calls in one assistant message ↩ Back to article
Anthropic's parallel-tool-calls documentation specifies the wire format for emitting multiple tool calls in a single assistant message and returning multiple tool_result blocks in a single user message. The doc is explicit that the runtime is responsible for collecting all results before the next inference and for preserving the model's tool-use IDs unchanged through the round trip. This is the source for the article's discussion of ID linkage as a quiet failure mode.
Anthropic Claude API Documentation. Read the guide
When Calls Fail
5Seven failure points, applied to tool calls ↩ Back to article
Barnett et al. catalog seven failure points when engineering a RAG system. The taxonomy maps cleanly onto tool-call pipelines: missing content, schema drift, invalid arguments, downstream service failure, unexpected response shape, network timeout, and retry-induced duplicates. The article's "five distinct ways a tool call can fail" is essentially the Barnett taxonomy collapsed and rephrased for the tool-loop case.
Barnett et al., 2024. Read on arXiv
6Structured outputs and JSON-mode validation ↩ Back to article
OpenAI's structured-outputs guide explains the constrained-decoding mechanism that makes JSON validity nearly free in modern providers. The article's observation that "modern providers enforce JSON validity at the decoder level, so syntactically broken arguments will almost always parse" is the practical takeaway. The remaining failure modes are semantic, not syntactic.
OpenAI Platform Documentation. Read the guide
7Tool calls as consumer-driven contracts ↩ Back to article
Martin Fowler's 2006 essay on consumer-driven contracts is the pre-LLM antecedent to the article's framing. The model is a consumer; the tool service is a producer. When the producer changes its response shape and the consumer's expectations do not move with it, you get a successful call that returns the wrong shape. Fowler's prescription, that the consumer publishes its expectations and the producer respects them, is exactly the contract a robust tool-loop runtime should enforce on top of vendor JSON validation.
Fowler, 2006. Read on martinfowler.com
8Timeouts, retries, and idempotency ↩ Back to article
The AWS Builders' Library essay on timeouts, retries, and exponential backoff with jitter is the canonical engineering reference for the timeout-vs-failure problem the article describes. A timeout is not a failure; it is the absence of a response. Retrying a non-idempotent endpoint can cause the same destructive operation to run twice. Every recommendation in the article's failure-mode section that involves retry behavior is grounded in this essay.
Amazon Builders' Library. Read on amazon.com
When to Terminate
9The agentic-systems literature ↩ Back to article
Wang et al.'s 2023 survey on LLM-based autonomous agents catalogs the architectures, planning strategies, and termination conditions that the field has explored. The survey is useful for situating the article's stance: "even sophisticated agentic loops are still while-loops with a probabilistic termination condition." Most published agent designs add structure on top of that loop (planners, verifiers, memory) but none replace the underlying probabilistic stopping rule.
Wang et al., 2023. Read on arXiv
Further Reading
10Tool use as a learned token-emission pattern
Schick et al. show that tool use is a behavior the model learns by self-supervised training on annotated tool-call traces. Toolformer's contribution is the training methodology; the conceptual finding (that "tool use" is just a learned output format, not an executive capability) is the foundation the article rests on when it says the model emits tokens that the runtime interprets as a function call.
Schick et al., 2023. Read on arXiv
11Schema and description quality predict reliability
Patil et al. fine-tune a model to use thousands of real APIs and report empirical results on the relationship between tool-description quality and tool-call accuracy. Their finding (that vague descriptions degrade performance measurably) is the empirical grounding for the companion article on schema design and is worth reading alongside this one for the wider context on what makes a tool loop reliable.
Patil et al., 2023. Read on arXiv
12Tool-call traces become training data
Qin et al. extend the function-calling pattern to thousands of APIs and show that tool-call logs become labeled training data once the system is structured around emit-then-execute. Every successful trace is a positive example; every failed trace, paired with the corrected dispatch, is a fine-tuning signal. Practical reading for any team thinking about the long tail of tool reliability after a robust loop is in production.
Qin et al., 2023. Read on arXiv