Sources

Grounding, citations, and further reading for Schemas That Models Can Follow.

All of this is optional. These are the sources behind the article. Nothing on this page is required reading, and you do not need to purchase any of these books.

The article itself is self-contained. This page exists so that the work is properly cited and so that anyone who wants to go deeper knows where to look.

References

1OpenAI

OpenAI. (2023). "Function calling." OpenAI API Documentation.

2Anthropic

Anthropic. (2024). "Tool use." Anthropic Documentation.

3JSON Schema

JSON Schema. (2020). "JSON Schema Specification." json-schema.org.

4Patil, S

Patil, S., et al. (2023). "Gorilla: Large Language Model Connected with Massive APIs." arXiv.

5Qin, Y

Qin, Y., et al. (2023). "ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs." arXiv.

Introduction

6Grounding note

Jurafsky & Martin (SLP3, Ch. 7, Section 7.3) establish the formal basis for why schemas matter. They define prompting as providing a context x that conditions the model's generation of y via P(y|x). The tool description and JSON Schema together constitute this conditioning context: the description conditions the model's decision about whether to call the tool, while the schema conditions the structure of the generated arguments. Both operate on the same autoregressive mechanism, so the quality of both directly determines the quality of the output.

7Grounding note

Widdows & Cohen provide useful context here. In Ch. 5 (Section 5.2.3), they describe how a model trained purely for next-token prediction can be converted into one that follows instructions with remarkably little additional training data. The description in a tool schema functions as exactly this kind of instruction: it activates the model's learned instruction-following behavior to select and invoke the right tool. Widdows & Cohen, Issue #45

The Description Is a Prompt

8Grounding note

The textbook notes that output format enforcement can catch syntax errors as they're generated and regenerate tokens until valid output is produced. Tool schemas are the declarative version of this: you define the valid structure up front, and the model's generation is constrained to match. See GH #3, Ch. 5.

9Grounding note

Jurafsky & Martin (SLP3, Ch. 7, Section 7.3) formalize this insight. They describe how prompts function as demonstrations that condition the model's generation: "By simply adding more demonstrations, even of a different task, the model infers the task from the context." A tool description is a zero-shot prompt, a single instruction that must convey the full task without any examples. The book notes that the quality of the prompt text directly determines model performance, and that "the exact wording of the task" matters. This is why "Search for things" fails: it provides insufficient conditioning signal for the model to disambiguate among possible behaviors.

10Grounding note

Widdows & Cohen discuss a parallel concept in Ch. 2 (Section 2.3.1) on statistical term weighting. Karen Sparck-Jones showed that specificity in search terms improves retrieval: rarer, more specific terms are more valuable for identifying relevant documents. Tool descriptions work the same way: "Search the product catalog by keyword" is highly specific and disambiguating, while "Search for things" is the equivalent of a high-frequency stopword that carries almost no useful signal. Widdows & Cohen, Issue #45

Constrain Everything

11Grounding note

Alammar & Grootendorst describe output verification techniques that constrain model output via examples and grammar rules, ensuring the generated text conforms to expected structures. JSON Schema constraints on tool parameters are the declarative equivalent: rather than post-hoc verification, you define the valid output space upfront so the model generates within bounds. See GH #5, Ch. 6.

12Grounding note

Jurafsky & Martin (SLP3, Ch. 7, Section 7.4) ground this in the mechanics of generation. At each token position, the model computes a softmax distribution over the entire vocabulary (Eq. 7.1), which can contain 50,000+ tokens. An unconstrained string parameter means the model samples from this full distribution for each character of the value. Every constraint you add, type restrictions, enums, format patterns, effectively masks portions of that distribution, reducing the effective vocabulary to the valid output space. Fewer valid options means higher probability mass on correct outputs. This is why constraints work: they do not change the model's capabilities, they narrow the space in which those capabilities operate.

13Grounding note

Jurafsky & Martin (SLP3, Ch. 3, Section 3.5) discuss how models generalize from training data and how unseen sequences create uncertainty. A model that has seen "cancelled" (British) and "canceled" (American) in training will assign probability to both spellings. An enum is the formal solution to this generalization problem: rather than hoping the model selects the spelling your API expects, you reduce the valid output to an explicit set, eliminating the entire space of near-synonyms and spelling variants. This is conceptually similar to the Laplace smoothing in Ch. 3, Section 3.6 (Eq. 3.24), which redistributes probability mass, except enums do the opposite: they concentrate all mass onto the valid options.

14Grounding note

Widdows & Cohen explain the underlying mechanism that makes enums so effective. In Ch. 5 (Section 5.2.2), they describe how next-token prediction works: the model computes a logit for every token in the vocabulary via scalar products, then applies softmax to produce probabilities. An enum constrains this distribution at the schema level, effectively zeroing out all logits except the valid options before the model even begins generating. This is why enums are so reliable: they reduce the output space from the full vocabulary to a handful of values. Widdows & Cohen, Issue #45

15Grounding note

Jurafsky & Martin (SLP3, Ch. 7, Section 7.3) describe in-context learning as the model's ability to infer task requirements from examples or instructions within the prompt. Parameter descriptions function as in-context learning signals: the format string "YYYY-MM-DD/YYYY-MM-DD" acts as an implicit one-shot example that conditions the model's generation toward that specific pattern. Without this signal, the model falls back on its pretraining distribution, which contains many valid date formats. The more specific the description, the stronger the conditioning signal, and the narrower the output distribution around the intended format.

The Nesting Problem

16Grounding note

Jurafsky & Martin (SLP3, Ch. 3, Section 3.1) formalize why nesting is hard for autoregressive models. The chain rule of probability (Eq. 3.3-3.4) decomposes P(w₁...w_n) into a product of conditional probabilities. Each additional level of nesting increases the number of structural tokens (braces, brackets, commas) the model must generate correctly, and each structural token depends on all preceding context. Error compounds multiplicatively: if each structural token has a 99% success rate, four levels of nesting with ~20 structural tokens gives roughly 0.99²⁰ = 82% overall success. Flattening the schema reduces the structural token count and breaks the compounding.

17Grounding note

Widdows & Cohen discuss an analogous scaling problem in Ch. 4. They note that for n-gram language models, complexity "blows up exponentially" with sequence length (proportional to k^N). Attention was introduced partly to handle long-range dependencies more efficiently, scaling quadratically instead. Deeply nested JSON schemas present a similar challenge: each additional level of nesting compounds the structural dependencies the model must track during autoregressive generation, leading to the error accumulation described here. Widdows & Cohen, Issue #45

Validation as Architecture

18Grounding note

Jurafsky & Martin (SLP3, Ch. 3, Section 3.3) define perplexity (Eq. 3.14-3.17) as a measure of how surprised the model is by a sequence: lower perplexity means the model found the sequence more predictable. The retry pattern works because the error message dramatically reduces perplexity on the second attempt: "Expected schema: {type: string, enum: [active, paused, cancelled]}" provides explicit conditioning that makes the correct output overwhelmingly probable. The first attempt has high perplexity over the valid values; the second attempt, conditioned on the error feedback, has near-zero perplexity.

19Grounding note

Widdows & Cohen provide theoretical grounding for why this retry pattern works. In Ch. 6 (Section 6.1.1), they explain that LLMs generate text that is plausible rather than factually grounded, and cite Turing's observation that inductive methods will "lead occasionally to erroneous results." The retry-with-error-feedback pattern accepts this as a design constraint rather than fighting it: the first attempt is the model's probabilistic best guess, and the validation error provides the structured signal needed to narrow the output distribution on the second attempt. Widdows & Cohen, Issue #45

Designing for the Model, Not for the Developer

20Grounding note

Alammar & Grootendorst describe chains as modular components connected together, with prompt templates structuring LLM interactions. Tool schemas serve the same architectural role: they are the contract between the chain orchestrator and the tool, and like prompt templates, they must be designed for the model's comprehension rather than the developer's convenience. See GH #5, Ch. 7.

21Grounding note

Widdows & Cohen reinforce this point in Ch. 5 (Section 5.2.3). They show that instruction-tuned models learn to "conform to the rhetorical structure of a correct answer" from training examples, not from reading documentation about how to answer. The prompts in the Alpaca training set were deliberately constructed to follow patterns the model could generalize from. Tool schemas must work the same way: they must encode their intent in the structure itself, because the model will pattern-match against them, not reason about external docs. Widdows & Cohen, Issue #45

22Grounding note

Widdows & Cohen describe the traditional architectural separation between a knowledge base (which stores structured facts) and a language model (which turns facts into fluent text) in Ch. 6 (Section 6.1.1, Figure 6.2). Tool schemas revive this principle in the agentic era: the schema is the structured contract that grounds the model's generation, preventing it from hallucinating argument values the same way a knowledge base prevents it from hallucinating facts. Explicit constraints in the schema play the role that structured data once played in traditional systems. Widdows & Cohen, Issue #45

23Grounding note

Jurafsky & Martin (SLP3, Ch. 7, Section 7.6) describe the evaluation problem that makes "boring" schemas valuable. They define perplexity-based evaluation and note the risk of data contamination, where test data leaks into training sets. "Boring" schemas with common, well-established naming patterns (e.g., get_order_status) are precisely the patterns the model has seen most frequently during instruction tuning (Section 7.5.1). The model has high confidence on familiar structures and low confidence on novel ones. Designing schemas that match the model's training distribution is not laziness; it is an engineering decision that maximizes the probability of correct output.