← Back to Blog

LLM Function Calling Done Right: From Raw Prompts to Production Tool Use

The Bridge Between Chat and Action

Every useful LLM application eventually needs to do something — not just say something. Query a database. Convert a currency. Call an API. Book a flight. The mechanism that makes this possible is function calling (also called tool use): you describe functions the model can invoke, and the model responds with structured calls to those functions instead of plain text.

Function calling is the atomic unit of LLM agency. The agent pattern is just function calling in a loop. Structured output uses the same schema machinery. Guardrails need to validate tool calls before execution. Every post in this applied series connects back to what we're building today.

In this post, we'll build a personal finance assistant that queries transaction data, converts currencies, and does calculations — all orchestrated through function calling. We'll compare OpenAI and Anthropic's APIs side by side, handle the edge cases that break production systems, and see exactly what happens on the wire at every step.

The Fragile Era: Parsing Tool Calls from Raw Text

Before native function calling APIs existed — and this was only mid-2023 — developers had to get creative. You described tools in the system prompt using XML, JSON, or plain English, then prayed the model would respond in a parseable format.

import re, json

SYSTEM_PROMPT = """You have access to these tools:

<tools>
  <tool name="query_transactions">
    <description>Search transaction history</description>
    <params>
      <param name="category" type="string"/>
      <param name="month" type="string">YYYY-MM format</param>
    </params>
  </tool>
</tools>

When you need a tool, respond ONLY with:
<tool_call name="...">{"key": "value"}</tool_call>
"""

def parse_tool_call(text):
    """Extract tool call from raw model output. Fragile!"""
    match = re.search(
        r'<tool_call name="(\w+)">(.*?)</tool_call>',
        text, re.DOTALL
    )
    if not match:
        return None  # Model didn't follow the format
    try:
        return match.group(1), json.loads(match.group(2).strip())
    except json.JSONDecodeError:
        return None  # Invalid JSON — trailing commas, single quotes...

# What goes wrong:
# 1. Model invents tools: <tool_call name="send_money">...
# 2. Malformed XML: missing closing tags, extra whitespace
# 3. Broken JSON: {category: 'food'} instead of {"category": "food"}
# 4. Unnecessary calls: "Sure! Let me check. <tool_call..."

Here's the uncomfortable truth: this approach works about 80% of the time with GPT-4 class models. The problem is the other 20%, which fails spectacularly — hallucinated function names, malformed JSON, tool calls embedded inside conversational fluff. And it's worth knowing this pattern because many open-source models (Llama 3.1, Mistral) still use variations of it under the hood when served without an OpenAI-compatible API layer.

Native Function Calling: OpenAI vs Anthropic

Both OpenAI and Anthropic now offer native function calling, but their implementations differ in subtle ways that matter when you're writing production code. Let's define the same three tools for our finance assistant on both platforms.

Defining Tools

OpenAI

OpenAI wraps each tool in a type: "function" envelope. The parameter schema follows JSON Schema. Setting strict: true enables constrained decoding, which guarantees the model's output matches your schema exactly.

{
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "query_transactions",
        "description": "Search transaction history by category, month, or account",
        "parameters": {
          "type": "object",
          "properties": {
            "category": {"type": "string", "description": "e.g. groceries, restaurants"},
            "month": {"type": "string", "description": "YYYY-MM format"},
            "account": {"type": "string", "enum": ["checking", "savings", "all"]}
          },
          "required": []
        }
      }
    },
    {
      "type": "function",
      "function": {
        "name": "convert_currency",
        "description": "Convert an amount between currencies",
        "parameters": {
          "type": "object",
          "properties": {
            "amount": {"type": "number"},
            "from_currency": {"type": "string", "description": "3-letter code like USD"},
            "to_currency": {"type": "string", "description": "3-letter code like EUR"}
          },
          "required": ["amount", "from_currency", "to_currency"]
        }
      }
    }
  ]
}

Anthropic

Anthropic uses a flatter structure — no function wrapper. The schema field is called input_schema instead of parameters.

{
  "tools": [
    {
      "name": "query_transactions",
      "description": "Search transaction history by category, month, or account",
      "input_schema": {
        "type": "object",
        "properties": {
          "category": {"type": "string", "description": "e.g. groceries, restaurants"},
          "month": {"type": "string", "description": "YYYY-MM format"},
          "account": {"type": "string", "enum": ["checking", "savings", "all"]}
        },
        "required": []
      }
    },
    {
      "name": "convert_currency",
      "description": "Convert an amount between currencies",
      "input_schema": {
        "type": "object",
        "properties": {
          "amount": {"type": "number"},
          "from_currency": {"type": "string", "description": "3-letter code like USD"},
          "to_currency": {"type": "string", "description": "3-letter code like EUR"}
        },
        "required": ["amount", "from_currency", "to_currency"]
      }
    }
  ]
}

The Three Differences That Bite You

The schemas look similar, but the responses diverge in ways that will cause subtle bugs if you're not careful:

The biggest gotcha when switching between providers: OpenAI returns function arguments as a JSON string you must parse. Anthropic returns input as a parsed object. Get this wrong and you'll spend an hour debugging a TypeError.

Difference 1: Where tool calls live. OpenAI puts them in a dedicated tool_calls array on the message. Anthropic mixes them into the content array alongside text blocks — a single response can contain both explanatory text and tool calls.

Difference 2: Arguments format. OpenAI's function.arguments is a JSON-encoded string — you must call json.loads(). Anthropic's input is already a parsed dictionary. This is the #1 source of cross-provider bugs.

Difference 3: How you return results. OpenAI uses a dedicated role: "tool" message. Anthropic embeds tool_result content blocks inside a role: "user" message. Sending a role: "tool" message to Anthropic's API will fail.

The Execution Loop: Building the Finance Assistant

The execution loop is the beating heart of function calling. It's also surprisingly simple — the entire pattern fits in about 30 lines. Send a message, check if the model wants to call tools, execute them, return the results, repeat.

Here's the complete loop for our personal finance assistant with OpenAI:

import openai, json

# --- Our three tool implementations ---
def query_transactions(category=None, month=None, account=None):
    """Query SQLite for transaction data. (Simplified for demo.)"""
    if account and not category:
        return {"balance": 4250.00, "currency": "USD", "as_of": "2026-02-26"}
    return {"total": 847.32, "currency": "USD", "count": 23, "category": category}

def convert_currency(amount, from_currency, to_currency):
    """Convert between currencies using current rates."""
    rates = {"USD_EUR": 0.9231, "USD_GBP": 0.7891, "EUR_USD": 1.0833}
    rate = rates.get(f"{from_currency}_{to_currency}")
    if not rate:
        raise ValueError(f"Unknown currency pair: {from_currency}/{to_currency}")
    return {"converted": round(amount * rate, 2), "rate": rate}

def calculate(expression):
    """Safely evaluate a math expression."""
    allowed = set("0123456789+-*/.() ")
    if not all(c in allowed for c in expression):
        raise ValueError(f"Unsafe expression: {expression}")
    return {"result": round(eval(expression), 2)}

TOOLS = {
    "query_transactions": query_transactions,
    "convert_currency": convert_currency,
    "calculate": calculate,
}

# --- The execution loop ---
def run_agent(user_message, tools_schema, max_turns=5):
    client = openai.OpenAI()
    messages = [
        {"role": "system", "content": "You are a personal finance assistant."},
        {"role": "user", "content": user_message},
    ]

    for turn in range(max_turns):
        response = client.chat.completions.create(
            model="gpt-4o", messages=messages, tools=tools_schema
        )
        msg = response.choices[0].message
        messages.append(msg)  # Always append assistant message

        if msg.tool_calls is None:
            return msg.content  # No tool calls — we have the final answer

        # Execute each tool call and return results
        for tc in msg.tool_calls:
            fn = TOOLS.get(tc.function.name)
            if fn is None:
                result = {"error": f"Unknown function: {tc.function.name}"}
            else:
                try:
                    args = json.loads(tc.function.arguments)  # STRING → dict
                    result = fn(**args)
                except Exception as e:
                    result = {"error": str(e)}

            messages.append({
                "role": "tool",
                "tool_call_id": tc.id,
                "content": json.dumps(result),
            })

    return "Reached max turns without a final answer."

For Anthropic, the loop structure is nearly identical. Here are the three lines that change:

import anthropic, json

def run_agent_anthropic(user_message, tools_schema, max_turns=5):
    client = anthropic.Anthropic()
    messages = [{"role": "user", "content": user_message}]

    for turn in range(max_turns):
        response = client.messages.create(
            model="claude-sonnet-4-6", max_tokens=1024,
            system="You are a personal finance assistant.",
            messages=messages, tools=tools_schema,
        )
        messages.append({"role": "assistant", "content": response.content})

        if response.stop_reason != "tool_use":       # Changed: "tool_use" not "tool_calls"
            # Extract text from content blocks
            return "".join(b.text for b in response.content if b.type == "text")

        # Execute tool calls from content blocks
        tool_results = []
        for block in response.content:
            if block.type != "tool_use":
                continue
            fn = TOOLS.get(block.name)
            if fn is None:
                tool_results.append({
                    "type": "tool_result", "tool_use_id": block.id,
                    "is_error": True, "content": f"Unknown function: {block.name}",
                })
                continue
            try:
                result = fn(**block.input)            # Changed: .input is already a dict
            except Exception as e:
                tool_results.append({
                    "type": "tool_result", "tool_use_id": block.id,
                    "is_error": True, "content": str(e),  # Changed: is_error flag
                })
                continue
            tool_results.append({
                "type": "tool_result", "tool_use_id": block.id,
                "content": json.dumps(result),
            })

        messages.append({"role": "user", "content": tool_results})  # Changed: role is "user"

    return "Reached max turns without a final answer."

Three key differences marked in the comments: the stop reason string, the pre-parsed input object, and tool results wrapped in a user message. Everything else — the loop structure, the tool registry, the error handling — is identical. This pattern connects directly to the building AI agents post, which takes this loop and adds memory, planning, and multi-agent orchestration on top.

Parallel Calls, Sequential Chains, and When Things Go Wrong

The loop above handles the basic case. But real-world tool use gets more interesting when the model calls multiple functions at once or chains them together.

Parallel Tool Calls

Both providers support parallel calls by default. When the model decides it needs multiple pieces of data that don't depend on each other, it fires all the calls in a single response. In the loop code above, this already works — we iterate over all tool calls and return all results before the next API round-trip.

One important gotcha: OpenAI's strict: true mode (constrained decoding) is not compatible with parallel tool calls. If you need schema guarantees, set parallel_tool_calls: false. Anthropic handles this differently — add disable_parallel_tool_use: true inside the tool_choice object.

Sequential Chains

When results from one tool inform the next call, the model naturally chains them. Ask our finance assistant "How much did I spend on groceries last month in euros?" and it will:

  1. Call query_transactions({category: "groceries", month: "2026-01"})
  2. Receive the USD total ($847.32)
  3. Call convert_currency({amount: 847.32, from: "USD", to: "EUR"})
  4. Combine both results into a natural answer

Each chain link requires a full API round-trip. Two tool calls in a chain means two round-trips, each adding TTFT + generation time. This is where latency budgets matter.

Handling Failures

Three things go wrong in production, and your loop needs to handle all of them:

Hallucinated function names. The model sometimes invents tools that don't exist, especially with open-source models. Always validate the function name against your registry before executing. The code above already does this with TOOLS.get(tc.function.name).

Malformed arguments. Even with native APIs, the model occasionally produces arguments that fail validation — wrong types, missing required fields, extra parameters your function doesn't accept. Wrap every execution in try/except and return the error to the model. It's usually smart enough to retry with corrected arguments.

Function exceptions. Your tool itself might fail — a database timeout, an invalid currency code, a division by zero. Anthropic provides an explicit is_error: true flag on tool results so the model knows something went wrong. OpenAI relies on parsing the error from the content string. In both cases, return a clear, structured error message — not a stack trace — so the model can decide whether to retry, try a different approach, or inform the user. This connects directly to the guardrails pattern: tool call validation is a critical safety layer.

Provider Comparison and Production Patterns

Here's the full comparison at a glance:

Feature OpenAI Anthropic Mistral
Schema field parameters input_schema parameters
Response format tool_calls[] array content[] blocks tool_calls[] array
Arguments type JSON string Parsed object JSON string
Stop signal tool_calls tool_use tool_calls
Result role "tool" "user" + tool_result "tool"
Error signaling Error in content is_error: true Error in content
Disable parallel parallel_tool_calls: false disable_parallel_tool_use parallel_tool_calls: false
Strict schemas strict: true strict: true (beta) N/A

Mistral follows OpenAI's format almost exactly — a deliberate choice that makes migration easy. Open-source models like Llama 3.1 use a native XML-style format (<function=name>{args}</function>) but are typically served through OpenAI-compatible APIs via vLLM or Ollama, which translate to the standard format.

Token Cost: The Hidden Multiplier

Tool definitions are injected into the system prompt on every API call. Both providers add ~300-530 tokens of internal overhead just for enabling tool use, plus each tool definition costs ~100-200 tokens depending on the description and schema complexity. With 10 tools, that's 1,500-2,500 extra input tokens on every single turn.

The conversation also grows faster with tool use. Each tool-call round-trip adds the assistant message (with tool call data), plus the tool result message. A 3-turn conversation with 2 tool calls per turn can easily hit 3,000-4,000 tokens of accumulated context.

Mitigation strategies:

For detailed latency numbers across providers and concurrency levels, see the LLM API latency benchmarks post. For evaluating whether your agent picks the right tools, see evaluating LLM systems. And for streaming responses that include tool calls — the hardest streaming pattern — see streaming LLM responses.

Try It: Function Calling Playground

Step through three real scenarios to see exactly what happens at each stage of the function calling loop. Watch the JSON payloads, function executions, and results flow between the model and your tools.

Function Calling Playground
Round trips: 0 Tool calls: 0 ~0 tokens

Click "Next" to step through the function calling trace

Wrapping Up

Function calling is the mechanism that turns a language model from something that generates text into something that takes action. The core pattern is simple — a loop of send, execute, return — but the production details matter: validating function names, parsing arguments correctly per provider, handling errors gracefully, and managing token costs as conversations grow.

Four things to remember:

  1. Validate every tool call before executing — check the function name against your registry and catch argument errors
  2. Know your provider's format — OpenAI's arguments are strings, Anthropic's are objects, and getting this wrong is the #1 cross-provider bug
  3. Cap your loop iterations — a max_turns guard prevents runaway costs when the model gets stuck in a tool-calling loop
  4. Prefer parallel calls — they save both tokens and latency compared to sequential chains

Now that you understand function calling, you're ready to build full agents on top of this foundation. The patterns here — tool registries, execution loops, error handling — are the same ones you'll use whether you're building a customer support bot, a data analysis pipeline, or an autonomous coding assistant.

References & Further Reading