← All Articles

From Prompts to Actions

Function calling is usually described as the moment a model gained the ability to use tools. Mechanically, it is the opposite: the model produces a small structured object, and the surrounding code does every part of the work that involves the world.

In Brief

Through Week 3, every prompt produced text. The model read your instructions and emitted prose, JSON, or whatever format the prompt specified. Your application read that text and decided what to do with it. The model never touched the outside world. Function calling does not change that fact. It changes the shape of the text the model is encouraged to produce: a structured call that names a function and supplies its arguments. Your runtime executes the call, feeds the result back, and the model continues. The model has not gained the ability to act, but the system around it has.

This article strips the vendor terminology and frames function calling as a four-move cycle in which only one move belongs to the model. The other three are ordinary application code. The article also names three mistakes the mainstream framing hides: the model is not "calling" anything, the schema is not "the model's API," and the model has no special relationship with side effects. Each one creates a class of bug that is hard to debug because the mental model misled the engineer in the first place.

. . .

The Question Behind the Feature

Vendors describe function calling with verbs that flatter the model. The model "uses tools." The model "calls the API." The model "queries the database." Read enough of this copy and you start to picture a small autonomous thing reaching out into a runtime, fingers on a keyboard, doing work. The picture is wrong. The model never reached anywhere. It produced a string of tokens that, when parsed, happens to describe a function name and a set of arguments. Your code took it from there.

The question worth asking before any of the syntax is: what part of this loop is the model actually doing, and what part is code I wrote? The answer changes how you design the system. If you believe the model is calling the function, you treat its output the way you treat a method invocation: trustworthy, atomic, complete. If you believe the model is producing a request that your runtime will execute, you treat its output the way you treat any incoming network payload from an untrusted client: parse it, validate it, log it, and decide for yourself whether to honor it.¹

Through Week 3, every prompt produced text. The model read your instructions and emitted prose, JSON, code, or whatever output format the prompt specified. Your application read that text and decided what to do with it. The model never touched the outside world. Function calling does not change that fact. It changes the shape of the text the model is encouraged to produce. When you pass a list of tool definitions alongside the prompt, the model gains a second output mode: it can emit a structured call that names a function and supplies its arguments. Your runtime executes the call, feeds the result back, and the model continues. The model has not gained the ability to act, but the system around it has.

. . .

What Actually Happens

Stripped of vendor terminology, the function-calling loop has four moves, and only one of them belongs to the model.

Move one: your application sends a request that contains the user's message and a list of tool definitions written in JSON Schema. The tool definitions are not magic. They are documentation. Each one says, in effect, here is a function name, here is a description of when the function is appropriate, and here are the parameters the function expects with their types and constraints.

Move two: the model reads the user message and the tool definitions together and decides whether the right next output is prose or a structured call. If a call, it emits the function name and a JSON object of arguments. The emission is the same kind of token-by-token autoregressive generation that produces any other output. The string just happens to parse as a tool invocation rather than a paragraph.

Move three: your runtime extracts the function name and arguments, validates them against the schema, and executes the corresponding function. This is ordinary application code. The model is not present in this step. If the function reads a database, opens a socket, writes a file, charges a credit card, or fires a missile, your code is doing it, in a process you control, with credentials you provisioned.

Move four: your application sends the result back to the model along with the prior conversation. The model reads the result and produces the next output, which may be prose for the user, another tool call, or a final answer. The cycle repeats until the model decides it is done.

Of those four moves, only one belongs to the model. The other three are code.

Move 2 is the only step the model owns; the surrounding three are ordinary application code.

To make this less abstract, picture a Hitchhiker's Guide entry lookup. The user types "What does the Guide say about Vogon poetry?" Move one: your code sends the request along with a tool definition for guide_lookup(topic: string). Move two: the model returns a structured call, guide_lookup({"topic": "Vogon poetry"}). Move three: your code receives that JSON, validates the topic against your schema, queries an actual Guide index, and gets back the warning that Vogon poetry is the third worst in the universe. Move four: your code sends that warning back to the model, which composes the user-facing reply. The model never read the index. It described the read it wanted, and your code performed it.²

. . .

What the Mainstream Framing Gets Wrong

The phrase "function calling lets LLMs use tools" hides three different mistakes inside one friendly sentence. Each one shapes how systems get built, and each one creates a class of bug that is hard to debug because the mental model misled the engineer in the first place.

The model is not "calling" anything

When a developer writes db.query("SELECT * FROM ..."), the verb call means something specific: control transfers to the database driver, the query runs, the result returns, control comes back. In function calling, the model has no analog of any of those steps. It produces a token sequence. If your runtime never parses it, never validates it, never executes the function, nothing happens. The function call is a description of an intent, not a transfer of control. Compare it to a ticket on a kanban board rather than a method invocation. The card on the board is real; the work it represents is not done until somebody pulls it.

This matters because it determines who is responsible for the outcome. A classifier that returns the wrong label did not "make a mistake calling the classifier." It emitted a structured call with arguments your code passed to a function your code wrote. Every subsequent fault belongs to a layer your team owns.

The schema is not "the model's API"

Vendors talk about JSON schemas as if they were the model's interface to your system. They are not. They are content the model is trained to attend to, written in a format that happens to also be machine-readable. The model is not bound by the schema in any compile-time sense. It can emit fields you did not declare. It can omit required fields. It can put strings where you specified integers. The schema is closer to a stylebook than a contract.³

Think of a content-moderation tool with a parameter severity: integer between 1 and 10. Nothing about the schema prevents the model from emitting "severity": "CRITICAL". The schema told the model what good looks like; the model produces tokens that approximate good most of the time. The runtime is what enforces the contract, by validating the value before passing it to the underlying scorer. If your validation layer trusts the schema to have already been honored, you have built a system whose correctness depends on a probabilistic process behaving deterministically. It will, until it does not.

The model has no special relationship with side effects

The most damaging version of the misframing shows up in security review. A reviewer sees that the model can "call delete_account()" and treats this as a model permission that needs to be revoked. The framing is incoherent. The model has no permissions. It has tokens. The function delete_account() runs because some piece of your code, holding some set of credentials, decided to run it in response to a parsed string. Removing the tool from the model's tool list does not remove the function from your codebase, and adding it does not grant the model anything it did not previously have, which is to say, nothing.⁴

This is why discussions of "agentic safety" that focus on what the model is allowed to do tend to drift. The control surface is not the model. It is the dispatcher in the middle. The model can emit any tool call it likes; what matters is whether the system around it acts on the emission. No function fires because the model named it.

. . .

The Pattern That Makes the Loop Work

Once you see function calling as a clean split of responsibility, the design pattern becomes obvious: the model decides what, the schema describes how, and the code controls whether and when. Each side does the part it is best suited to.⁹

The model is good at reading ambiguous natural language and selecting from a small set of known options. Given a user's request and five tool definitions, picking the right tool is exactly the kind of soft pattern matching transformer architectures excel at. The model is bad at executing multi-step deterministic logic, holding strict invariants over long sequences, or refusing to produce output when the right answer is "I cannot help." Code is good at all of those things.

The schema is the medium that lets the two collaborate. A well-described tool acts on the model the way good API documentation acts on a junior developer: it narrows the choices, names the parameters, signals the intent. A poorly described tool, with vague names and unconstrained parameters, leaves the model to invent. The model will invent confidently, because that is what language models do when underspecified.⁵

The code is the layer where reality lives. Every assertion about what the system actually did, what was logged, what was charged, what was deleted, lives in code your team wrote. This is good news. It means the same engineering disciplines that worked before language models, that is, schema validation, structured logging, retries, rate limits, dead-letter queues, audit trails, work just as well at the boundary between your code and a model. They simply move one layer outward, treating the model's output as untrusted input from a network peer.¹⁰

This split is why text-only models forced a choice between two bad options. You could ask the model to do everything in one shot, hoping the prompt covered every case, or you could ask it to produce instructions you parsed and dispatched yourself. The first option broke on tasks the model lacked information for. The second collapsed prompt engineering into glorified string matching. Function calling provides a third path: the model decides what to do, structured calls describe how, and your code retains control of execution. Each side handles what it does best.

. . .

A Short Inventory of Tool Examples

To make the pattern stick, here are five tools drawn from familiar fictional universes. Each one shows a different shape of work the model is good at delegating and code is good at performing.

Tool	What the model does	What the code does
Customer intent classifier	Reads the customer's free-text message, picks the classifier tool, fills in `text` with the relevant excerpt.	Runs the classifier model, returns a label and confidence score, logs the inference for later evaluation.
Hitchhiker's Guide entry lookup	Selects the lookup tool, supplies the topic the user asked about.	Queries the Guide index, paginates results, sanitizes any entries flagged "Mostly Harmless" before returning them.
Anomaly flagger	Decides the input is a candidate for review, packages the parameters.	Runs the anomaly model, enforces score bounds, alerts a human operator on confirmed flags.
Flesh wound severity tool	Reads the casualty description, classifies the wound type, fills in the severity request.	Validates the description against medical vocabulary, returns a triage code, refuses to honor any request to label a severed limb as superficial.
Gandalf advice retrieval	Picks the advice tool, supplies the situation summary.	Retrieves the relevant passage, attaches a citation, declines to invent quotations not present in the source corpus.

The model picks the tool and shapes the arguments; the code does everything that touches the world.

Notice that in each row the right column is where the engineering effort lives. The model's contribution is to look at a messy human request and figure out which tool, with which arguments, fits. That is genuinely useful, because it is exactly the kind of judgment that is brittle to write as code. The rest of the work, that is, validation, execution, refusal, logging, citation, is exactly the kind of work that is brittle to delegate to a model. Function calling does not eliminate engineering. It rearranges it so that the model handles the linguistic layer and code handles the operational layer.⁶

. . .

Why This Matters for the Rest of the Course

This is the joint where the rest of the course clicks together. Once you can move from text to structured calls, the layers that come next stop being a sequence of disconnected techniques and start looking like a single architecture.

Retrieval-augmented generation, the topic of Weeks 5 and 6, is function calling applied to the model's own knowledge limitations. A retrieval tool is just another entry in the tool list: it takes a query, returns documents, and the same four-move cycle runs. Everything you learn this week about schemas, validation, and failure handling carries forward, because retrieval is one specific load-bearing tool among many.

Fine-tuning, the topic of Week 7, is what you do when the model's tool selection or argument generation is not reliable enough for your domain and prompt engineering has hit its limit. The decision to fine-tune almost always starts with logs of tool-call traffic showing where the prompted model gets it wrong. Without a function-calling architecture, you do not have those logs in a structured form. With one, every move-two emission is a labeled training example waiting to be collected.⁷

Evaluation and the integrated systems work in Week 8 ride on the same foundation. A system that emits structured calls is a system you can evaluate per-tool: how often does the model pick the right tool, fill in the right arguments, recover from a tool error? Those questions are answerable with ordinary metrics because the unit of analysis is no longer free-form text but a typed object. The agentic systems we build later in the course assume this baseline. Without it, "evaluating an agent" reduces to reading transcripts and arguing about whether they are good.

. . .

For Practitioners

If you take only a handful of working principles out of this article, take these.

1 / 6

Validation

Treat tool calls as untrusted input. The model emits JSON; your runtime validates it. Schema is documentation for the model, not a contract you can rely on. Validate types, ranges, enums, and required fields before any function executes.

Dispatcher

Own the dispatcher. The code that routes a parsed tool call to a function is the security boundary, the audit point, and the rate limiter. Write it yourself, log every dispatch, and make it the place where authorization decisions live.

Tool descriptions

Describe tools with the same care you would a public API. Vague names and underspecified parameters force the model to invent. Verb-noun function names, precise parameter descriptions, units in the description text, and tight enums turn most reliability problems into the model's strength rather than its weakness.

Move-two logging

Log the full move-two output. The structured call the model emitted, the arguments before validation, the validation outcome, the function result, and the next-turn input. Five fields per call: the dataset you will need for evaluation, debugging, and any future fine-tuning effort.

Refusal

Refuse where the model cannot. The model has no robust way to refuse a tool call. Build refusals at the dispatcher: rate limits, scope checks, principal-of-least-privilege credentials, and explicit allowlists for irreversible operations.

Vocabulary

Stop using "the model called it" in design conversations. The model emitted, your code dispatched, the function executed. The change in vocabulary changes which engineer is responsible for fixing what.

Week 4's remaining articles drill into the specifics of these principles. The mechanics article walks through actual provider request-response shapes. The schemas article covers the discipline of describing tools the model can use reliably. The loops article handles the multi-step and parallel cases. The security article covers what happens when the boundary between model output and function execution is the place an attacker tries to live. All of them assume the framing in this piece. Function calling is not a capability the model gained. It is a pattern that lets a probabilistic system collaborate with a deterministic one without either pretending to be the other.⁸

. . .

References

OpenAI. (2023). "Function calling." OpenAI Platform Documentation. The vendor-side framing, useful as a foil for the alternative reading offered here.
Anthropic. (2024). "Tool use with Claude." Anthropic Documentation. Notes the request-response loop and the runtime's role in execution.
JSON Schema. (2020). "JSON Schema Specification." json-schema.org. The actual contract format. Worth reading once to demystify what the schema is and is not.
Greshake, K., et al. (2023). "Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection." arXiv. The clearest paper on why the security boundary lives in the dispatcher.
Patil, S., et al. (2023). "Gorilla: Large Language Model Connected with Massive APIs." arXiv. Empirical results on how schema and description quality affect tool-call reliability.
Schick, T., et al. (2023). "Toolformer: Language Models Can Teach Themselves to Use Tools." arXiv. Early work establishing that tool use is a learned token-emission pattern, not an executive capability.
Qin, Y., et al. (2023). "ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs." arXiv. Shows how tool-call traces become training data once the system is structured around emit-then-execute.
Anthropic. (2024). "Building Effective Agents." Anthropic Research. Argues for keeping the orchestration layer simple and putting the heavy lifting in tools and code rather than the model.
Yao, S., et al. (2023). "ReAct: Synergizing Reasoning and Acting in Language Models." arXiv. Frames tool use as interleaved reasoning and action, with the action layer external to the model.
Huyen, C. (2024). "Building LLM applications for production." Engineering blog. Practical commentary on why production reliability comes from the code wrapping the model, not the model itself.

Function Calling Tool Use Schema Design Dispatcher Pattern Agentic Loop Software Engineering