LLM Function Calling Done Right: From Raw Prompts to Production Tool Use
The Bridge Between Chat and Action
Every useful LLM application eventually needs to do something — not just say something. Query a database. Convert a currency. Call an API. Book a flight. The mechanism that makes this possible is function calling (also called tool use): you describe functions the model can invoke, and the model responds with structured calls to those functions instead of plain text.
Function calling is the atomic unit of LLM agency. The agent pattern is just function calling in a loop. Structured output uses the same schema machinery. Guardrails need to validate tool calls before execution. Every post in this applied series connects back to what we're building today.
In this post, we'll build a personal finance assistant that queries transaction data, converts currencies, and does calculations — all orchestrated through function calling. We'll compare OpenAI and Anthropic's APIs side by side, handle the edge cases that break production systems, and see exactly what happens on the wire at every step.
The Fragile Era: Parsing Tool Calls from Raw Text
Before native function calling APIs existed — and this was only mid-2023 — developers had to get creative. You described tools in the system prompt using XML, JSON, or plain English, then prayed the model would respond in a parseable format.
import re, json
SYSTEM_PROMPT = """You have access to these tools:
<tools>
<tool name="query_transactions">
<description>Search transaction history</description>
<params>
<param name="category" type="string"/>
<param name="month" type="string">YYYY-MM format</param>
</params>
</tool>
</tools>
When you need a tool, respond ONLY with:
<tool_call name="...">{"key": "value"}</tool_call>
"""
def parse_tool_call(text):
"""Extract tool call from raw model output. Fragile!"""
match = re.search(
r'<tool_call name="(\w+)">(.*?)</tool_call>',
text, re.DOTALL
)
if not match:
return None # Model didn't follow the format
try:
return match.group(1), json.loads(match.group(2).strip())
except json.JSONDecodeError:
return None # Invalid JSON — trailing commas, single quotes...
# What goes wrong:
# 1. Model invents tools: <tool_call name="send_money">...
# 2. Malformed XML: missing closing tags, extra whitespace
# 3. Broken JSON: {category: 'food'} instead of {"category": "food"}
# 4. Unnecessary calls: "Sure! Let me check. <tool_call..."
Here's the uncomfortable truth: this approach works about 80% of the time with GPT-4 class models. The problem is the other 20%, which fails spectacularly — hallucinated function names, malformed JSON, tool calls embedded inside conversational fluff. And it's worth knowing this pattern because many open-source models (Llama 3.1, Mistral) still use variations of it under the hood when served without an OpenAI-compatible API layer.
Native Function Calling: OpenAI vs Anthropic
Both OpenAI and Anthropic now offer native function calling, but their implementations differ in subtle ways that matter when you're writing production code. Let's define the same three tools for our finance assistant on both platforms.
Defining Tools
OpenAI
OpenAI wraps each tool in a type: "function" envelope. The parameter schema follows JSON Schema. Setting strict: true enables constrained decoding, which guarantees the model's output matches your schema exactly.
{
"tools": [
{
"type": "function",
"function": {
"name": "query_transactions",
"description": "Search transaction history by category, month, or account",
"parameters": {
"type": "object",
"properties": {
"category": {"type": "string", "description": "e.g. groceries, restaurants"},
"month": {"type": "string", "description": "YYYY-MM format"},
"account": {"type": "string", "enum": ["checking", "savings", "all"]}
},
"required": []
}
}
},
{
"type": "function",
"function": {
"name": "convert_currency",
"description": "Convert an amount between currencies",
"parameters": {
"type": "object",
"properties": {
"amount": {"type": "number"},
"from_currency": {"type": "string", "description": "3-letter code like USD"},
"to_currency": {"type": "string", "description": "3-letter code like EUR"}
},
"required": ["amount", "from_currency", "to_currency"]
}
}
}
]
}
Anthropic
Anthropic uses a flatter structure — no function wrapper. The schema field is called input_schema instead of parameters.
{
"tools": [
{
"name": "query_transactions",
"description": "Search transaction history by category, month, or account",
"input_schema": {
"type": "object",
"properties": {
"category": {"type": "string", "description": "e.g. groceries, restaurants"},
"month": {"type": "string", "description": "YYYY-MM format"},
"account": {"type": "string", "enum": ["checking", "savings", "all"]}
},
"required": []
}
},
{
"name": "convert_currency",
"description": "Convert an amount between currencies",
"input_schema": {
"type": "object",
"properties": {
"amount": {"type": "number"},
"from_currency": {"type": "string", "description": "3-letter code like USD"},
"to_currency": {"type": "string", "description": "3-letter code like EUR"}
},
"required": ["amount", "from_currency", "to_currency"]
}
}
]
}
The Three Differences That Bite You
The schemas look similar, but the responses diverge in ways that will cause subtle bugs if you're not careful:
The biggest gotcha when switching between providers: OpenAI returns function arguments as a JSON string you must parse. Anthropic returns input as a parsed object. Get this wrong and you'll spend an hour debugging a TypeError.
Difference 1: Where tool calls live. OpenAI puts them in a dedicated tool_calls array on the message. Anthropic mixes them into the content array alongside text blocks — a single response can contain both explanatory text and tool calls.
Difference 2: Arguments format. OpenAI's function.arguments is a JSON-encoded string — you must call json.loads(). Anthropic's input is already a parsed dictionary. This is the #1 source of cross-provider bugs.
Difference 3: How you return results. OpenAI uses a dedicated role: "tool" message. Anthropic embeds tool_result content blocks inside a role: "user" message. Sending a role: "tool" message to Anthropic's API will fail.
The Execution Loop: Building the Finance Assistant
The execution loop is the beating heart of function calling. It's also surprisingly simple — the entire pattern fits in about 30 lines. Send a message, check if the model wants to call tools, execute them, return the results, repeat.
Here's the complete loop for our personal finance assistant with OpenAI:
import openai, json
# --- Our three tool implementations ---
def query_transactions(category=None, month=None, account=None):
"""Query SQLite for transaction data. (Simplified for demo.)"""
if account and not category:
return {"balance": 4250.00, "currency": "USD", "as_of": "2026-02-26"}
return {"total": 847.32, "currency": "USD", "count": 23, "category": category}
def convert_currency(amount, from_currency, to_currency):
"""Convert between currencies using current rates."""
rates = {"USD_EUR": 0.9231, "USD_GBP": 0.7891, "EUR_USD": 1.0833}
rate = rates.get(f"{from_currency}_{to_currency}")
if not rate:
raise ValueError(f"Unknown currency pair: {from_currency}/{to_currency}")
return {"converted": round(amount * rate, 2), "rate": rate}
def calculate(expression):
"""Safely evaluate a math expression."""
allowed = set("0123456789+-*/.() ")
if not all(c in allowed for c in expression):
raise ValueError(f"Unsafe expression: {expression}")
return {"result": round(eval(expression), 2)}
TOOLS = {
"query_transactions": query_transactions,
"convert_currency": convert_currency,
"calculate": calculate,
}
# --- The execution loop ---
def run_agent(user_message, tools_schema, max_turns=5):
client = openai.OpenAI()
messages = [
{"role": "system", "content": "You are a personal finance assistant."},
{"role": "user", "content": user_message},
]
for turn in range(max_turns):
response = client.chat.completions.create(
model="gpt-4o", messages=messages, tools=tools_schema
)
msg = response.choices[0].message
messages.append(msg) # Always append assistant message
if msg.tool_calls is None:
return msg.content # No tool calls — we have the final answer
# Execute each tool call and return results
for tc in msg.tool_calls:
fn = TOOLS.get(tc.function.name)
if fn is None:
result = {"error": f"Unknown function: {tc.function.name}"}
else:
try:
args = json.loads(tc.function.arguments) # STRING → dict
result = fn(**args)
except Exception as e:
result = {"error": str(e)}
messages.append({
"role": "tool",
"tool_call_id": tc.id,
"content": json.dumps(result),
})
return "Reached max turns without a final answer."
For Anthropic, the loop structure is nearly identical. Here are the three lines that change:
import anthropic, json
def run_agent_anthropic(user_message, tools_schema, max_turns=5):
client = anthropic.Anthropic()
messages = [{"role": "user", "content": user_message}]
for turn in range(max_turns):
response = client.messages.create(
model="claude-sonnet-4-6", max_tokens=1024,
system="You are a personal finance assistant.",
messages=messages, tools=tools_schema,
)
messages.append({"role": "assistant", "content": response.content})
if response.stop_reason != "tool_use": # Changed: "tool_use" not "tool_calls"
# Extract text from content blocks
return "".join(b.text for b in response.content if b.type == "text")
# Execute tool calls from content blocks
tool_results = []
for block in response.content:
if block.type != "tool_use":
continue
fn = TOOLS.get(block.name)
if fn is None:
tool_results.append({
"type": "tool_result", "tool_use_id": block.id,
"is_error": True, "content": f"Unknown function: {block.name}",
})
continue
try:
result = fn(**block.input) # Changed: .input is already a dict
except Exception as e:
tool_results.append({
"type": "tool_result", "tool_use_id": block.id,
"is_error": True, "content": str(e), # Changed: is_error flag
})
continue
tool_results.append({
"type": "tool_result", "tool_use_id": block.id,
"content": json.dumps(result),
})
messages.append({"role": "user", "content": tool_results}) # Changed: role is "user"
return "Reached max turns without a final answer."
Three key differences marked in the comments: the stop reason string, the pre-parsed input object, and tool results wrapped in a user message. Everything else — the loop structure, the tool registry, the error handling — is identical. This pattern connects directly to the building AI agents post, which takes this loop and adds memory, planning, and multi-agent orchestration on top.
Parallel Calls, Sequential Chains, and When Things Go Wrong
The loop above handles the basic case. But real-world tool use gets more interesting when the model calls multiple functions at once or chains them together.
Parallel Tool Calls
Both providers support parallel calls by default. When the model decides it needs multiple pieces of data that don't depend on each other, it fires all the calls in a single response. In the loop code above, this already works — we iterate over all tool calls and return all results before the next API round-trip.
One important gotcha: OpenAI's strict: true mode (constrained decoding) is not compatible with parallel tool calls. If you need schema guarantees, set parallel_tool_calls: false. Anthropic handles this differently — add disable_parallel_tool_use: true inside the tool_choice object.
Sequential Chains
When results from one tool inform the next call, the model naturally chains them. Ask our finance assistant "How much did I spend on groceries last month in euros?" and it will:
- Call
query_transactions({category: "groceries", month: "2026-01"}) - Receive the USD total ($847.32)
- Call
convert_currency({amount: 847.32, from: "USD", to: "EUR"}) - Combine both results into a natural answer
Each chain link requires a full API round-trip. Two tool calls in a chain means two round-trips, each adding TTFT + generation time. This is where latency budgets matter.
Handling Failures
Three things go wrong in production, and your loop needs to handle all of them:
Hallucinated function names. The model sometimes invents tools that don't exist, especially with open-source models. Always validate the function name against your registry before executing. The code above already does this with TOOLS.get(tc.function.name).
Malformed arguments. Even with native APIs, the model occasionally produces arguments that fail validation — wrong types, missing required fields, extra parameters your function doesn't accept. Wrap every execution in try/except and return the error to the model. It's usually smart enough to retry with corrected arguments.
Function exceptions. Your tool itself might fail — a database timeout, an invalid currency code, a division by zero. Anthropic provides an explicit is_error: true flag on tool results so the model knows something went wrong. OpenAI relies on parsing the error from the content string. In both cases, return a clear, structured error message — not a stack trace — so the model can decide whether to retry, try a different approach, or inform the user. This connects directly to the guardrails pattern: tool call validation is a critical safety layer.
Provider Comparison and Production Patterns
Here's the full comparison at a glance:
| Feature | OpenAI | Anthropic | Mistral |
|---|---|---|---|
| Schema field | parameters |
input_schema |
parameters |
| Response format | tool_calls[] array |
content[] blocks |
tool_calls[] array |
| Arguments type | JSON string | Parsed object | JSON string |
| Stop signal | tool_calls |
tool_use |
tool_calls |
| Result role | "tool" |
"user" + tool_result |
"tool" |
| Error signaling | Error in content | is_error: true |
Error in content |
| Disable parallel | parallel_tool_calls: false |
disable_parallel_tool_use |
parallel_tool_calls: false |
| Strict schemas | strict: true |
strict: true (beta) |
N/A |
Mistral follows OpenAI's format almost exactly — a deliberate choice that makes migration easy. Open-source models like Llama 3.1 use a native XML-style format (<function=name>{args}</function>) but are typically served through OpenAI-compatible APIs via vLLM or Ollama, which translate to the standard format.
Token Cost: The Hidden Multiplier
Tool definitions are injected into the system prompt on every API call. Both providers add ~300-530 tokens of internal overhead just for enabling tool use, plus each tool definition costs ~100-200 tokens depending on the description and schema complexity. With 10 tools, that's 1,500-2,500 extra input tokens on every single turn.
The conversation also grows faster with tool use. Each tool-call round-trip adds the assistant message (with tool call data), plus the tool result message. A 3-turn conversation with 2 tool calls per turn can easily hit 3,000-4,000 tokens of accumulated context.
Mitigation strategies:
- Prune tools per turn — only send the tools relevant to the current state
- Use prompt caching — Anthropic's
cache_controland OpenAI's automatic prefix caching reduce costs on repeated tool definitions - Prefer parallel over sequential — one round-trip with 3 parallel calls is cheaper than 3 sequential round-trips
- Cap iterations — set
max_turns=5(or similar) to prevent runaway costs
For detailed latency numbers across providers and concurrency levels, see the LLM API latency benchmarks post. For evaluating whether your agent picks the right tools, see evaluating LLM systems. And for streaming responses that include tool calls — the hardest streaming pattern — see streaming LLM responses.
Try It: Function Calling Playground
Step through three real scenarios to see exactly what happens at each stage of the function calling loop. Watch the JSON payloads, function executions, and results flow between the model and your tools.
Click "Next" to step through the function calling trace
Wrapping Up
Function calling is the mechanism that turns a language model from something that generates text into something that takes action. The core pattern is simple — a loop of send, execute, return — but the production details matter: validating function names, parsing arguments correctly per provider, handling errors gracefully, and managing token costs as conversations grow.
Four things to remember:
- Validate every tool call before executing — check the function name against your registry and catch argument errors
- Know your provider's format — OpenAI's arguments are strings, Anthropic's are objects, and getting this wrong is the #1 cross-provider bug
- Cap your loop iterations — a
max_turnsguard prevents runaway costs when the model gets stuck in a tool-calling loop - Prefer parallel calls — they save both tokens and latency compared to sequential chains
Now that you understand function calling, you're ready to build full agents on top of this foundation. The patterns here — tool registries, execution loops, error handling — are the same ones you'll use whether you're building a customer support bot, a data analysis pipeline, or an autonomous coding assistant.
References & Further Reading
- OpenAI — Function Calling Guide — Official docs for the Chat Completions tools parameter and execution patterns.
- Anthropic — Tool Use Overview — Complete reference for Claude's tool_use content blocks and input_schema format.
- Mistral AI — Function Calling — Mistral's OpenAI-compatible tool use implementation.
- Meta — Llama 3.1 Model Card — Native tool calling format for open-source Llama models.
- JSON Schema — The specification underlying all function calling parameter schemas.
- DadOps — Building AI Agents with Tool Use — Takes the execution loop from this post and adds memory, planning, and multi-agent patterns.
- DadOps — Structured Output from LLMs — Tool call arguments are structured output — the same JSON Schema principles apply.
- DadOps — Guardrails for LLM Applications — Tool call validation as a defense-in-depth guardrail layer.
- DadOps — Streaming LLM Responses — The hardest streaming pattern: interleaved text and tool calls.
- DadOps — Evaluating LLM Systems — How to measure tool selection accuracy and argument correctness.