Sources
Grounding, citations, and further reading for Function Calling Across Providers.
All of this is optional. These are the canonical provider docs and benchmarks consulted while writing the article, shown here so the wire-format claims can be verified against the source. Nothing on this page is required reading.
The article is self-contained. This page exists so each numbered reference is properly cited and so that anyone who wants to read the full provider documentation knows where to look.
About the Sources
Provider documentation (OpenAI, Anthropic, Google, AWS, Ollama, vLLM)
The provider docs are the only authoritative source for the wire-format claims in the article: parameter names, schema location, response shape, argument typing, and reliability flags. When the article says OpenAI returns tool_calls[] on the assistant message and Anthropic returns tool_use content blocks, those claims trace directly to the docs cited below. The shapes here are accurate as of late 2024 / early 2025; expect drift on the order of months as each provider iterates.
Berkeley Function Calling Leaderboard (BFCL)
The reliability column in the side-by-side table draws from BFCL scores. The benchmark evaluates models across simple tool calls, multiple-tool calls, parallel calls, function relevance, and live executable scenarios. It is the most-cited public benchmark for tool-use reliability and is regularly updated as new models ship.
Wrapper library documentation (LiteLLM)
LiteLLM's docs are the reference for what the cross-provider wrapper layer actually does and does not pass through. The "what leaks through" observations in the article (provider-specific errors, dropped parameters, version drift) are documented behaviors, not speculation.
The Wire Format Per Provider
1OpenAI: top-level tools, schema under function.parameters
The OpenAI guide documents the full request and response shape for tool-enabled chat completions. Tool definitions are an array of { type: "function", function: { name, description, parameters } } objects. The model's structured call comes back as a tool_calls[] array on the assistant message, where arguments is a stringified JSON object that the runtime must parse before validation. Multiple parallel calls in the same response are the default behavior.
OpenAI Platform: Function Calling guide. ↩ Back to article
2OpenAI: strict: true structured outputs
OpenAI's structured-outputs mode applies token-level constraints during decoding so that the model can only emit JSON that conforms to a supplied schema. Setting strict: true on a tool definition raises that tool's reliability close to 100% syntactically, at the cost of some flexibility (additional fields are rejected, optional fields are not really optional in the same way). The article's claim that tool-call reliability is "high" for OpenAI assumes strict mode is available.
OpenAI Platform: Structured Outputs guide. ↩ Back to article
3Anthropic: top-level tools, schema under input_schema, tool_use blocks
Anthropic's tool-use docs detail the request and response shape for Claude. Tool definitions live at tools[].input_schema rather than under a function.parameters nesting, and tool calls return as tool_use content blocks inside the assistant message's content[] array, with arguments already parsed as an object (no JSON.parse needed). Tool results are returned via a user-role message containing a tool_result content block with the matching tool_use_id; there is no separate tool role. The strict alternation of user and assistant roles makes some logging and replay simpler than the OpenAI shape.
Anthropic API: Tool Use with Claude. ↩ Back to article
4Gemini: function_declarations nested in tools[], OpenAPI dialect
The Gemini API wraps function declarations inside a tools[].function_declarations[] structure and uses the OpenAPI 3.0 schema dialect rather than vanilla JSON Schema, which means a few field names and capitalizations differ. Tool calls come back as functionCall objects inside the response's parts[] arrays. The assistant role is named model rather than assistant, which is a small but persistent source of bugs when porting code between providers.
Google AI for Developers: Function Calling with the Gemini API. ↩ Back to article
5AWS Bedrock Converse: toolConfig envelope, toolUse blocks
The Converse API normalizes function calling across the model families Bedrock hosts (Claude, Llama, Mistral, Cohere, Titan). Tool definitions live at toolConfig.tools[].toolSpec with the JSON Schema wrapped in an inputSchema.json envelope. Tool calls come back as toolUse content blocks. The wrapping is what makes Bedrock useful for multi-model deployments: you swap the modelId string and the request and response shape stay constant. The reliability of the underlying model is unchanged by the wrapper, which is why the article rates Bedrock reliability as inheriting from the underlying model.
Bedrock User Guide: Use the Converse API with Tool Use. ↩ Back to article
6Ollama: OpenAI-compatible Chat Completions endpoint
Ollama exposes an OpenAI-compatible Chat Completions endpoint at http://localhost:11434/v1. The wire format is identical to OpenAI's, which means existing OpenAI SDK code can target a local Llama or Qwen model with only the model name and base URL changed. The compatibility is wire-only; the underlying model's tool-use behavior is whatever that model's training produced. The article's reliability column for Ollama (variable, model-dependent) traces directly to this point.
Ollama: OpenAI Compatibility. ↩ Back to article
7vLLM: guided decoding and grammar-constrained tool calls
vLLM's tool-calling docs cover the OpenAI-compatible endpoint and the guided-decoding features (Outlines, JSON Schema, CFG grammars) that constrain output at the token level. The article's claim that constrained decoding fixes the syntax of a tool call but cannot fix the semantics is the central operational lesson here: a model that does not understand when to call the tool will still call it at the wrong time, just with valid JSON.
vLLM: Tool Calling. ↩ Back to article
Reliability and Wrappers
8Berkeley Function Calling Leaderboard
BFCL is the standard public benchmark for cross-model tool-use reliability. It evaluates simple calls, multiple tools, parallel calls, function relevance (correctly choosing not to call), and live executable scenarios. The article's reliability ratings track BFCL's general findings: provider-native frontier models score in the high 90s on simple cases, open-weights models trail by 5 to 15 percentage points, and the gap narrows on each major open-weights release. The 5% reliability gap mentioned in the article corresponds to BFCL's "live" and "irrelevance" categories, which are the failure modes users actually see.
Berkeley AI Research: Berkeley Function Calling Leaderboard. ↩ Back to article
9LiteLLM: cross-provider function-calling shim
LiteLLM's docs describe how the wrapper accepts the OpenAI request shape and translates it to whatever provider you target, returning the OpenAI response shape regardless of where the call actually went. The article's "what leaks through" observations (provider-specific error codes, rate-limit shapes, dropped parameters like Anthropic's prompt caching or Gemini's grounding) are documented limitations, not edge cases. The 80/20 framing in the article (LiteLLM removes 80% of the boilerplate and concentrates the remaining 20% into failure moments) is consistent with the docs' own positioning.