← All Articles

Function Calling Across Providers

The four-move cycle is invariant across providers; the wire format is not. A side-by-side reference for the same tool defined five different ways, with an honest accounting of where the differences leak through the wrappers that promise to hide them.

. . .

The Cycle Is Invariant; the Wire Format Is Not

Every major provider implements function calling using the same four-move cycle. You send a prompt with a list of tool definitions. The model decides whether to answer in text or to emit a structured call. Your runtime executes the call and returns the result. The model produces its final response or another tool call. That cycle is the protocol, and once you have implemented it against one provider, you have implemented the conceptual work for all of them.

What is not invariant is the wire format. OpenAI calls the parameter tools and expects a JSON Schema under function.parameters. Anthropic also calls the parameter tools but puts the schema under input_schema and returns calls as tool_use content blocks rather than a separate tool_calls field. Google Gemini wraps everything in function_declarations nested inside a tools array. AWS Bedrock provides its own toolConfig envelope through the Converse API, normalizing across the model families it hosts. Open-weights models served through Ollama and vLLM expose an OpenAI-compatible endpoint that approximates the contract with variable reliability depending on the underlying model.

The mainstream framing is that you should pick a wrapper library and ignore the differences. That works most of the time. The engineering reality is that the differences leak through anyway when something fails: a 400 error with a provider-specific shape, a tool argument that one model parses correctly and another mangles, a streaming chunk format that does not match what your wrapper expects. This article is the reference you reach for when the abstraction stops abstracting.

. . .

The Four Moves, Five Ways

The running example throughout this article is a single tool. A user asks the model a question about something obscure: the dietary preferences of Bugblatter Beasts, the reason mice are actually hyperintelligent pan-dimensional beings, the appropriate towel-related etiquette for interstellar hitchhikers. The model needs to look the answer up. So it calls a function:

lookup_hitchhikers_guide_entry(topic: str) -> str

One required string parameter, one return value: the simplest possible non-trivial tool. Watch what happens to that definition as it travels across providers.

OpenAI Chat Completions

OpenAI uses a top-level tools array, with each entry tagged type: "function". The schema lives under function.parameters as standard JSON Schema. When the model decides to call the tool, it returns a tool_calls array on the assistant message, with the arguments as a JSON-encoded string (not a parsed object).1

↗ docs// Request payload (truncated)
{
  "model": "gpt-4o",
  "messages": [
    { "role": "user", "content": "What does the Guide say about towels?" }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "lookup_hitchhikers_guide_entry",
        "description": "Retrieve the entry for a given topic from the Hitchhiker's Guide to the Galaxy.",
        "parameters": {
          "type": "object",
          "properties": {
            "topic": {
              "type": "string",
              "description": "The subject to look up, such as 'towel' or 'Vogon poetry'."
            }
          },
          "required": ["topic"]
        }
      }
    }
  ]
}

// Assistant response when the model decides to call
{
  "role": "assistant",
  "content": null,
  "tool_calls": [
    {
      "id": "call_abc123",
      "type": "function",
      "function": {
        "name": "lookup_hitchhikers_guide_entry",
        "arguments": "{\"topic\":\"towel\"}"
      }
    }
  ]
}

You return the result as a message with role: "tool" and the matching tool_call_id. The arguments string is a stringified JSON object, which means you must parse it yourself before validation. The same payload supports OpenAI's strict structured-output mode when reliability matters more than flexibility.2

Anthropic Messages API

Anthropic also uses a top-level tools array, but the schema lives directly under input_schema rather than under a function.parameters nesting. The bigger structural difference is on the response side. Anthropic returns content as an array of typed blocks, and a tool call is one of those blocks, with the arguments already parsed as an object.3

↗ docs// Request payload (truncated)
{
  "model": "claude-sonnet-4-6",
  "max_tokens": 1024,
  "tools": [
    {
      "name": "lookup_hitchhikers_guide_entry",
      "description": "Retrieve the entry for a given topic from the Hitchhiker's Guide to the Galaxy.",
      "input_schema": {
        "type": "object",
        "properties": {
          "topic": {
            "type": "string",
            "description": "The subject to look up, such as 'towel' or 'Vogon poetry'."
          }
        },
        "required": ["topic"]
      }
    }
  ],
  "messages": [
    { "role": "user", "content": "What does the Guide say about towels?" }
  ]
}

// Assistant response when the model decides to call
{
  "role": "assistant",
  "content": [
    {
      "type": "tool_use",
      "id": "toolu_01A09q90qw90lq917835lq9",
      "name": "lookup_hitchhikers_guide_entry",
      "input": { "topic": "towel" }
    }
  ],
  "stop_reason": "tool_use"
}

The result is returned as a user message containing a tool_result content block with the matching tool_use_id. There is no separate "tool" role; tool results travel inside user-role messages. This is a small change with a real consequence: the entire conversation is structured as a strict alternation of user and assistant messages, which makes some logging and replay code simpler than the OpenAI shape.

Google Gemini

Gemini wraps function declarations inside a tools array, but at the next level down uses function_declarations rather than typed entries. Gemini's parameter schema follows the OpenAPI 3.0 schema dialect rather than vanilla JSON Schema, which means a few field names differ (type values are uppercase, for example). The response shape places functionCall objects inside parts arrays under each candidate.4

↗ docs// Request payload (truncated)
{
  "contents": [
    {
      "role": "user",
      "parts": [{ "text": "What does the Guide say about towels?" }]
    }
  ],
  "tools": [
    {
      "function_declarations": [
        {
          "name": "lookup_hitchhikers_guide_entry",
          "description": "Retrieve the entry for a given topic from the Hitchhiker's Guide to the Galaxy.",
          "parameters": {
            "type": "OBJECT",
            "properties": {
              "topic": {
                "type": "STRING",
                "description": "The subject to look up, such as 'towel' or 'Vogon poetry'."
              }
            },
            "required": ["topic"]
          }
        }
      ]
    }
  ]
}

// Candidate response when the model decides to call
{
  "candidates": [
    {
      "content": {
        "role": "model",
        "parts": [
          {
            "functionCall": {
              "name": "lookup_hitchhikers_guide_entry",
              "args": { "topic": "towel" }
            }
          }
        ]
      },
      "finishReason": "STOP"
    }
  ]
}

The result is returned as a functionResponse part on a subsequent user-role content block. Gemini's role for the assistant is model rather than assistant, which is another small naming variation that consistently trips people up when they paste code between providers.

AWS Bedrock Converse

Bedrock hosts a dozen different model families behind a single API, so its Converse API normalizes function calling across them. The tool definition lives under toolConfig.tools[].toolSpec, with the JSON Schema wrapped in an inputSchema.json envelope. Tool calls come back as toolUse blocks inside a content array, similar in spirit to Anthropic but with different field names.5

↗ docs// Request payload (truncated)
{
  "modelId": "anthropic.claude-3-5-sonnet-20241022-v2:0",
  "messages": [
    {
      "role": "user",
      "content": [{ "text": "What does the Guide say about towels?" }]
    }
  ],
  "toolConfig": {
    "tools": [
      {
        "toolSpec": {
          "name": "lookup_hitchhikers_guide_entry",
          "description": "Retrieve the entry for a given topic from the Hitchhiker's Guide to the Galaxy.",
          "inputSchema": {
            "json": {
              "type": "object",
              "properties": {
                "topic": {
                  "type": "string",
                  "description": "The subject to look up, such as 'towel' or 'Vogon poetry'."
                }
              },
              "required": ["topic"]
            }
          }
        }
      }
    ]
  }
}

// Output content when the model decides to call
{
  "output": {
    "message": {
      "role": "assistant",
      "content": [
        {
          "toolUse": {
            "toolUseId": "tooluse_xyz789",
            "name": "lookup_hitchhikers_guide_entry",
            "input": { "topic": "towel" }
          }
        }
      ]
    }
  },
  "stopReason": "tool_use"
}

Bedrock's wrapper hides the underlying model's native shape: whether you point Converse at Claude, Llama, Mistral, Cohere, or Titan, the request and response schemas above are identical. The translation happens server-side. This is the most useful thing Bedrock does for tool use, because it means you can swap the modelId string and rerun without touching the surrounding code.

Ollama (OpenAI-Compatible Endpoint)

Ollama exposes an OpenAI-compatible Chat Completions endpoint at http://localhost:11434/v1. You can point an OpenAI SDK at it and the same code that worked against api.openai.com will run against a local Llama or Qwen model. The wire format is identical to the OpenAI example above. The behavior is not.6

↗ docs// Same payload as the OpenAI example, but with model name and endpoint changed
{
  "model": "llama3.2:latest",
  "messages": [
    { "role": "user", "content": "What does the Guide say about towels?" }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "lookup_hitchhikers_guide_entry",
        "description": "Retrieve the entry for a given topic from the Hitchhiker's Guide to the Galaxy.",
        "parameters": {
          "type": "object",
          "properties": {
            "topic": { "type": "string" }
          },
          "required": ["topic"]
        }
      }
    }
  ]
}

The compatibility layer accepts the OpenAI shape and either constrains the underlying model to emit a structured call (when the model supports it natively) or applies a prompt template that asks the model to produce a JSON tool call (when it does not). The reliability depends entirely on the underlying model. We will get to why in a moment.

. . .

Side by Side: What Actually Differs

The five formats above implement the same contract, but the names, nesting depths, and reliability vary. The table below collects the differences in one place.

Provider Tool param Schema location Call response shape Args type Parallel calls Streaming Structured outputs Reliability
OpenAI tools function.parameters tool_calls[] on assistant message JSON string Yes (default) Yes (deltas per call) Yes, strict: true High
Anthropic tools input_schema tool_use content blocks Parsed object Yes (opt-in via prompt) Yes (input JSON streamed) Partial via tool schemas High
Google Gemini tools.function_declarations parameters (OpenAPI dialect) functionCall in parts[] Parsed object Yes Yes Yes, responseSchema High
AWS Bedrock toolConfig.tools toolSpec.inputSchema.json toolUse content blocks Parsed object Yes (model-dependent) Yes (Converse Stream) Inherits from underlying model High (Claude, Llama 3.1+); variable for older models
Ollama / vLLM tools (OpenAI-compatible) function.parameters tool_calls[] (sometimes) JSON string (sometimes) Model-dependent Partial Via grammar / outlines, not standardized Variable; depends on the model
Five providers implementing one cycle, with the field names, response shapes, and reliability levels that diverge in practice.

The reliability column is the one worth lingering on. The first four providers are running models that were trained or post-trained specifically for tool use, on infrastructure that constrains output at the decoding layer. The last row is a wrapper that may or may not have any of those guarantees, depending on which model file you pulled.

. . .

Where Open-Weights Falls Short

When you call OpenAI or Anthropic with tools defined, three things happen that you do not see. First, the model was trained or post-trained on a corpus that includes tool-use examples in the exact format the API expects. The model learned that when it sees a tools field in its context, the appropriate response is sometimes a structured call. Second, the inference stack constrains decoding to keep the output well-formed: at the moment the model commits to a tool call, the decoder restricts the next token to legal continuations of the JSON Schema. Third, the response is parsed and validated by the provider before it reaches your code.

Ollama and vLLM provide the third step, and partially the second, but neither owns the first. They are running a model file that someone else trained. If the model was post-trained for tool use, like Llama 3.1+ or Qwen 2.5, you get reasonable results. If the model was not, the OpenAI-compatible endpoint falls back to prompt templating: it injects the tool definitions into a system prompt that asks the model to please respond in a specific JSON shape when it wants to call a function. The reliability of that approach is exactly the reliability of any other instruction-following task on a base model. Which is to say, variable.

The failure modes are consistent enough to enumerate. The model produces JSON that is almost valid but missing a closing brace. The model wraps the JSON in Markdown code fences and your parser does not strip them. The model invents an argument the schema did not declare, because it has seen similar tools in its training data. The model produces a textual description of what it would do instead of producing the call. The model emits the call in the wrong shape entirely, with the function name as a string in a content field rather than as a structured object.

Tools like Outlines, llama.cpp grammar files, and vLLM's guided decoding can constrain the output at the token level using a JSON Schema or a CFG grammar. This raises reliability significantly for any model, even ones that were not specifically trained for tool use. The catch is that constrained decoding fixes the syntax but cannot fix the semantics: a model that does not understand when to call the tool will still call it at the wrong time, just with valid JSON.7

The practical guidance: develop against open-weights for cost and iteration speed, but run your evaluation suite against the provider-native model that will serve production. The functional behavior should be similar enough that you can iterate locally, but never assume parity. The last 5% reliability gap is exactly where the failures users actually see live.8

. . .

Wrapper Libraries: LiteLLM, OpenAI Agents SDK, LangChain

Three categories of wrapper try to hide everything you just read, and each succeeds at different parts of the job.

LiteLLM is the lowest-level wrapper. It accepts the OpenAI request shape and translates it to whatever provider you target, returning the OpenAI response shape regardless of where the call actually went. For tool use, this means you can call litellm.completion(...) with OpenAI-style tools and get back OpenAI-style tool_calls, even when Anthropic or Bedrock served the request. What leaks through: provider-specific error codes, rate-limit shapes, and any feature that has no OpenAI equivalent. If you need Anthropic's prompt caching or Gemini's grounding, LiteLLM either passes the parameter through (where it survives) or silently drops it.9

OpenAI Agents SDK is opinionated higher up the stack. It assumes you are building an agent loop, gives you primitives for tool definition, handoffs between agents, and tracing. For function calling specifically, it normalizes the OpenAI shape and supports a small set of additional providers through compatibility shims. What leaks through: anything that does not fit the agent abstraction. If your application is not really an agent loop, the SDK gets in your way more than it helps.

LangChain is the broadest abstraction, and the one most likely to be in the codebase you inherit. Its tool-calling primitives wrap each provider's native shape and present a unified BaseTool interface. What leaks through: parsing failures, where LangChain swallowed the raw error and surfaced something less informative; provider-specific quirks that the unified interface flattened away; and version drift, because the LangChain wrapper for any given provider can lag the provider's API by weeks or months.

The honest summary is that wrapper libraries reduce the per-provider boilerplate by roughly 80%, and concentrate the remaining 20% into the moments when something fails. If your code never fails, the wrappers are pure win. If it fails sometimes, you will eventually need to know what the underlying provider returned, which is what this article exists to document.

. . .

For Practitioners

Implement against the native SDK first. Even if you plan to use a wrapper, write a working tool-calling loop against one provider's raw API at least once. The four moves are simple, and seeing them in their native shape gives you the mental model you need to debug the wrapper later.

Pick a wrapper based on what you are willing to lose. LiteLLM gives up provider-specific features in exchange for a uniform call shape. LangChain gives up clarity in exchange for breadth. The OpenAI Agents SDK gives up flexibility in exchange for an opinionated agent loop. The right choice depends on which of those losses matters least for your application.

Test on the model that will run in production. A prompt and a tool definition that work perfectly with GPT-4o may behave differently on Claude Sonnet, Gemini, or Llama 3.1. Behavior, not just wire format, varies. Run your evaluation suite against every model you might deploy, with the same tool definitions and the same test cases. Do this with promptfoo or a similar framework so you can compare outputs side by side.

Treat open-weights tool use as a development tier, not a production tier, until proven otherwise. Iterate locally with Ollama for the cost savings. Promote to a hosted provider for the reliability. If you do want to run open-weights in production, plan for the additional engineering investment in constrained decoding, validation, and retry logic that the hosted providers handle for you.

Log the raw provider response on every tool call. Wrapper logs are normalized. Provider logs are ground truth. When something is failing in production at 3am, the difference between a normalized log that says "tool call failed" and a raw response that shows the model emitted a malformed argument is the difference between a five-minute fix and a five-hour debugging session.

. . .

References

Provider documentation, the Berkeley benchmark, and wrapper-library notes for each numbered reference in this article live on the companion sources page.

  1. OpenAI. (2024). "Function Calling." OpenAI Platform documentation.
  2. OpenAI. (2024). "Structured Outputs." OpenAI Platform documentation.
  3. Anthropic. (2024). "Tool Use with Claude." Anthropic API documentation.
  4. Google. (2024). "Function Calling with the Gemini API." Google AI for Developers.
  5. Amazon Web Services. (2024). "Use the Converse API with Tool Use." Bedrock User Guide.
  6. Ollama. (2024). "OpenAI Compatibility." Ollama documentation.
  7. vLLM Project. (2024). "Tool Calling." vLLM documentation.
  8. Berkeley AI Research. (2024). "Berkeley Function Calling Leaderboard." Comparative evaluation of provider-native and open-weights tool use.
  9. BerriAI. (2024). "Function Calling with LiteLLM." LiteLLM documentation.
Function Calling Tool Use LLM APIs Provider Comparison Wire Formats Schema Design