Sources
Grounding, citations, and further reading for From Prompts to Actions.
All of this is optional. These are the sources used to write the article, listed here for grounding and so anyone who wants to go deeper on a specific point knows where to look.
The article itself is self-contained. Nothing on this page is required reading.
About the Sources
Provider documentation: OpenAI and Anthropic
The two primary commercial provider documents that define the function-calling contract in 2024-2025. Useful as the framing the article argues against, and as the canonical reference for the wire-format details.
JSON Schema Specification
The format that everyone reaches for when they describe a tool. Worth reading once end-to-end to understand what the schema is doing and, more importantly, what it is not doing.
Foundational research papers (2023)
The 2023 papers that established tool use as a learned token-emission pattern rather than an executive capability. Each takes a different angle: Toolformer on training, Gorilla on schema and description quality, ToolLLM on scaling tool counts, ReAct on interleaving reasoning with action.
Practitioner writing
Three practitioner sources that ground the article's stance in real production experience. Anthropic's "Building Effective Agents" argues for keeping orchestration simple. Greshake et al. is the canonical paper on indirect prompt injection. Huyen's "Building LLM applications for production" makes the case that production reliability comes from the code wrapping the model, not the model itself.
The Question Behind the Feature
1The vendor framing of function calling ↩ Back to article
OpenAI's function calling guide is the canonical vendor-side framing the article positions itself against. It describes the model as "calling" functions and the developer as "providing" tools, language that smooths over the actual mechanics of what is happening. Useful as a foil: read it first to see the framing, then read the article to see the alternative framing of the same wire protocol.
OpenAI Platform Documentation. Read the guide
What Actually Happens
2The four-move loop in vendor terms ↩ Back to article
Anthropic's tool-use documentation describes the same request-response loop the article calls the "four-move cycle," with one important emphasis: the doc is explicit that "the runtime executes the function" and the model "receives the result as a new message." This honesty about which actor performs which step is exactly the framing the article wants the reader to internalize. Read alongside the OpenAI guide above for a side-by-side view of how two vendors describe the same protocol.
Anthropic Documentation. Read the guide
What the Mainstream Framing Gets Wrong
3What JSON Schema actually specifies ↩ Back to article
The JSON Schema specification defines a vocabulary for describing the structure and constraints of JSON documents. Reading it once demystifies the relationship between the schema and the model: the schema is a description, validated by code that chooses to validate. Nothing in the spec mentions language models. The "schema as the model's API" framing is a layer applied on top by tooling vendors, not a property of the format itself.
json-schema.org. Read the spec
4Indirect prompt injection and the dispatcher boundary ↩ Back to article
Greshake et al. demonstrate that LLM-integrated applications are vulnerable to indirect prompt injection: instructions embedded in content the model reads via a tool, rather than in the user's input. The paper's catalog of attack patterns (web pages, emails, calendar entries) makes the case that the security boundary cannot live inside the model. Every layer of validation has to live in the dispatcher: the code that decides which functions run with which arguments. The article's section on misframing security review draws directly from this framing.
Greshake et al., 2023. Read on arXiv
The Pattern That Makes the Loop Work
5Schema and description quality predict tool-call reliability ↩ Back to article
Patil et al. fine-tune a model to use thousands of real APIs and report empirical results on the relationship between schema quality and tool-call accuracy. Their key finding: how a tool is described matters as much as which tools are available. Vague names and underspecified parameters degrade performance measurably. Concrete, verb-noun function names with tightly described parameters produce reliable tool selection. The article's "schema is documentation for the model" stance rests on this finding.
Patil et al., 2023. Read Gorilla on arXiv
9Reasoning and acting as interleaved layers ↩ Back to article
Yao et al. introduce the ReAct pattern: the model produces interleaved reasoning steps and action requests, with the action layer external to the model. This is the academic predecessor to the four-move cycle. ReAct's key contribution is showing that separating "thinking" tokens from "tool call" tokens, and executing the tool calls in code, produces more reliable behavior than letting the model pretend to act in prose. The article's design pattern of "model decides what, schema describes how, code controls whether and when" is essentially the ReAct decomposition with the boundaries named.
Yao et al., 2023. Read ReAct on arXiv
A Short Inventory of Tool Examples
6Tool use as a learned token-emission pattern ↩ Back to article
Schick et al. show that tool use is a behavior the model learns by self-supervised training on annotated tool-call traces. The model is not given an executive capability; it is trained to emit tokens that, when interpreted by surrounding code, look like tool calls. This is the cleanest academic statement of the article's claim that the model never actually calls anything. Toolformer's contribution is the training methodology; the conceptual finding (that "tool use" is just a learned output format) is what the article rests on.
Schick et al., 2023. Read Toolformer on arXiv
Why This Matters for the Rest of the Course
7Tool-call traces as training data ↩ Back to article
Qin et al. extend the function-calling pattern to thousands of APIs and show that tool-call logs become labeled training data once the system is structured around emit-then-execute. Every successful trace is a positive example; every failed trace, paired with the corrected dispatch, is a fine-tuning signal. The article's claim that "every move-two emission is a labeled training example waiting to be collected" is the practical takeaway from this paper: a function-calling architecture is also a fine-tuning data pipeline.
Qin et al., 2023. Read ToolLLM on arXiv
10Production reliability lives in the code, not the model ↩ Back to article
Chip Huyen's essay on building LLM applications for production makes the practitioner case that the engineering disciplines of schema validation, structured logging, retries, rate limits, and audit trails do not change in the LLM era. They simply move one layer outward, treating the model's output as untrusted input from a network peer. The article's claim that "the code is the layer where reality lives" is a tighter restatement of Huyen's argument.
Huyen, C., 2024. Read on huyenchip.com
For Practitioners
8Keeping the orchestration layer simple ↩ Back to article
Anthropic's "Building Effective Agents" essay argues for keeping orchestration logic simple and putting the heavy lifting in tools and code rather than in clever prompting of the model. The piece distinguishes between "workflows" (deterministic graphs of steps that occasionally call a model) and "agents" (the model in a loop with tools), and recommends starting with the former. The article's "For Practitioners" framing inherits this stance: own the dispatcher, log every move-two output, and treat refusals as something code does, not the model.
Anthropic Research, 2024. Read on anthropic.com