Building AI Agents with Tool Use: From Chat to Action

February 25, 2026 · Applied AI · 14 min read

The One-Line Insight That Changes Everything

Every AI agent — from ChatGPT's code interpreter to Claude Code to Devin — is built on the same surprisingly simple idea:

A language model in a while loop.

That's it. An LLM that can call functions, wrapped in a loop that feeds the results back. The model decides what to do next. Your code just executes the tools and keeps the loop running. This pattern is so fundamental that once you see it, you'll recognize it everywhere: coding agents, research assistants, data analysis pipelines, customer support bots.

In this post, we'll build a working AI agent from scratch in about 80 lines of Python. No LangChain. No CrewAI. No frameworks at all. Just raw API calls and a while loop. By the end, you'll understand exactly how every agent framework works under the hood — because you'll have built the core yourself.

We'll start with the basics (teaching an LLM to call a single function), then build up to a full agent loop with multi-step reasoning, error recovery, and safety guardrails. We'll show working code for both OpenAI and Anthropic APIs, because the pattern is the same even when the API shapes differ.

Function Calling 101: Teaching an LLM to Use Tools

Before we build an agent, we need the building block: function calling. This is the mechanism that lets an LLM say "I want to call this function with these arguments" instead of just generating text.

Here's the key mental model: the model never executes anything. It just outputs a structured request — "please call get_weather with location='Tokyo'" — and your code does the actual execution. The model is the brain; your code is the hands.

Let's see the simplest possible example. We'll define a weather tool and let the model use it:

from openai import OpenAI
import json

client = OpenAI()  # uses OPENAI_API_KEY env var

# Step 1: Define a tool — this is a JSON Schema that tells the
# model what function exists and what arguments it takes
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name, e.g. 'San Francisco'"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

# Step 2: Call the API with the tool definition
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
)

# Step 3: The model doesn't answer — it asks us to call a function
msg = response.choices[0].message
print(msg.tool_calls[0].function.name)       # "get_weather"
print(msg.tool_calls[0].function.arguments)   # '{"location": "Tokyo"}'

The model saw our tool definition and decided that answering this question requires calling get_weather. It returned a tool call instead of a text response. Now we execute it and feed the result back:

# Step 4: Execute the tool (in real life, call a weather API)
def get_weather(location):
    # Fake implementation for demo purposes
    return {"temp": 22, "condition": "partly cloudy", "city": location}

# Parse the arguments and call our function
tool_call = msg.tool_calls[0]
args = json.loads(tool_call.function.arguments)
result = get_weather(**args)

# Step 5: Send the result back to the model
messages = [
    {"role": "user", "content": "What's the weather in Tokyo?"},
    msg,  # the assistant's response (with tool_calls)
    {
        "role": "tool",
        "tool_call_id": tool_call.id,
        "content": json.dumps(result),
    }
]

final = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    tools=tools,
)

print(final.choices[0].message.content)
# "The weather in Tokyo is 22°C and partly cloudy."

That's function calling in a nutshell: define a schema, let the model decide to call it, execute the function yourself, and send the result back for a final response. Two API calls, one tool execution in between.

But this is just a single tool call. The model makes one decision and we're done. To build an agent, we need to let the model make many decisions in a row — and that means adding a loop.

The API Shape: OpenAI vs. Anthropic

Before we build the loop, let's acknowledge reality: most developers work with more than one LLM provider. The function calling pattern is the same everywhere, but the API shapes differ in ways that will bite you if you don't know about them upfront.

Here's the same weather tool defined for Anthropic's Claude:

import anthropic

client = anthropic.Anthropic()  # uses ANTHROPIC_API_KEY env var

# Anthropic uses "input_schema" instead of "parameters",
# and there's no wrapping "function" object
tools = [
    {
        "name": "get_weather",
        "description": "Get the current weather for a city",
        "input_schema": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "City name, e.g. 'San Francisco'"
                }
            },
            "required": ["location"]
        }
    }
]

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
)

The response structure differs in a fundamental way. OpenAI puts tool calls in a separate tool_calls field. Anthropic mixes them into the content array alongside text blocks — which means the model can explain what it's doing while calling a tool:

# Anthropic response.content is a list of mixed blocks:
# [
#   TextBlock(type='text', text="I'll check the weather in Tokyo."),
#   ToolUseBlock(type='tool_use', id='toolu_abc123',
#                name='get_weather', input={'location': 'Tokyo'})
# ]

# Feed the result back as a "user" message with tool_result blocks
# (OpenAI uses a dedicated "tool" role instead)
messages.append({"role": "assistant", "content": response.content})
messages.append({
    "role": "user",
    "content": [{
        "type": "tool_result",
        "tool_use_id": "toolu_abc123",
        "content": json.dumps({"temp": 22, "condition": "partly cloudy"})
    }]
})

Here's a quick reference for the key differences:

Feature	OpenAI	Anthropic
Schema field	`parameters`	`input_schema`
Tool calls location	Separate `tool_calls` field	Mixed into `content` blocks
Arguments format	JSON string (must parse)	Parsed object (ready to use)
Tool result role	`role: "tool"`	`role: "user"` + `tool_result`
Stop signal	No `tool_calls` on message	`stop_reason: "end_turn"`
Force a tool	`tool_choice: {"type": "function", ...}`	`tool_choice: {"type": "tool", "name": ...}`

The one that catches everyone: OpenAI returns tool arguments as a JSON string that you need to json.loads(), while Anthropic gives you a ready-to-use Python dict. Forgetting this is the #1 source of "my agent works with Claude but breaks with GPT" bugs.

From Function Calling to Agent: Adding the Loop

Here's the leap from "function calling" to "agent." Single-turn tool use means: user asks, model calls one function, you get an answer. Two API calls, fixed flow. An agent means: user asks, and the model decides its own execution path — calling tools, reading results, calling more tools, until it has enough information to answer.

The implementation is almost anticlimactically simple:

import json
from openai import OpenAI

class Agent:
    def __init__(self, system_prompt, tools, tool_functions, max_turns=10):
        self.client = OpenAI()
        self.system_prompt = system_prompt
        self.tools = tools                     # tool schemas for the API
        self.tool_functions = tool_functions    # {"name": callable} mapping
        self.max_turns = max_turns

    def run(self, user_message):
        messages = [
            {"role": "system", "content": self.system_prompt},
            {"role": "user", "content": user_message},
        ]

        for turn in range(self.max_turns):
            response = self.client.chat.completions.create(
                model="gpt-4o",
                messages=messages,
                tools=self.tools,
            )

            msg = response.choices[0].message
            messages.append(msg)

            # No tool calls? The model is done — return the answer
            if not msg.tool_calls:
                return msg.content

            # Execute every tool the model requested
            for tool_call in msg.tool_calls:
                name = tool_call.function.name
                args = json.loads(tool_call.function.arguments)

                try:
                    result = self.tool_functions[name](**args)
                except Exception as e:
                    result = {"error": str(e)}

                messages.append({
                    "role": "tool",
                    "tool_call_id": tool_call.id,
                    "content": json.dumps(result) if not isinstance(result, str) else result,
                })

        return "Agent reached maximum turns without finishing."

That's about 40 lines for a working agent. Let's walk through the loop:

Call the model with the full conversation history and tool definitions.
Check the response. If there are no tool calls, the model has decided it has enough information — return its text response.
Execute tools. For each tool call, look up the function by name and call it. If it throws, catch the error and return it as a string.
Append results to the conversation and go back to step 1.

The critical design choice is in the error handling: when a tool fails, we don't crash the loop. We return the error message to the model as a normal tool result. The model can read the error, understand what went wrong, and try a different approach. This "error as data" pattern is what separates robust agents from brittle ones.

Building a Real Agent: The Research Assistant

Let's wire up some real tools and watch the agent think. We'll build a research assistant that can search for files, read their contents, and do math. Three tools, each just a few lines:

import glob
import os

def search_files(pattern, directory="."):
    """Find files matching a glob pattern."""
    matches = glob.glob(os.path.join(directory, pattern), recursive=True)
    return {"files": matches[:20], "total": len(matches)}

def read_file(path, max_lines=50):
    """Read the first N lines of a file."""
    try:
        with open(path) as f:
            all_lines = f.readlines()
        return {
            "content": "".join(all_lines[:max_lines]),
            "total_lines": len(all_lines),
        }
    except FileNotFoundError:
        return {"error": f"File not found: {path}"}

def calculate(expression):
    """Safely evaluate a math expression."""
    allowed = set("0123456789+-*/.() ")
    if not all(c in allowed for c in expression):
        return {"error": "Invalid characters in expression"}
    return {"result": eval(expression)}

Now define the tool schemas and create the agent:

tools = [
    {
        "type": "function",
        "function": {
            "name": "search_files",
            "description": "Search for files matching a glob pattern",
            "parameters": {
                "type": "object",
                "properties": {
                    "pattern": {
                        "type": "string",
                        "description": "Glob pattern, e.g. '**/*.py'"
                    },
                    "directory": {
                        "type": "string",
                        "description": "Directory to search in (default: current)"
                    }
                },
                "required": ["pattern"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "read_file",
            "description": "Read the contents of a file",
            "parameters": {
                "type": "object",
                "properties": {
                    "path": {"type": "string", "description": "File path to read"},
                    "max_lines": {"type": "integer", "description": "Max lines (default 50)"}
                },
                "required": ["path"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "calculate",
            "description": "Evaluate a mathematical expression",
            "parameters": {
                "type": "object",
                "properties": {
                    "expression": {"type": "string", "description": "Math expression, e.g. '(10 + 20) / 3'"}
                },
                "required": ["expression"]
            }
        }
    }
]

agent = Agent(
    system_prompt="You are a helpful research assistant. Use your tools to answer questions accurately.",
    tools=tools,
    tool_functions={
        "search_files": search_files,
        "read_file": read_file,
        "calculate": calculate,
    },
)

answer = agent.run("How many Python files are in this project, and what's the total line count?")

What happens under the hood? The model enters a multi-step reasoning chain that it controls entirely on its own:

Turn 1: The model thinks "I need to find all Python files" and calls search_files(pattern="**/*.py").
Turn 2: It gets back a list of 12 files. Now it calls read_file on each one to count lines (it may batch several calls in parallel).
Turn 3: With all the line counts, it calls calculate to sum them up.
Turn 4: It has everything it needs. No more tool calls — it returns a natural language answer: "There are 12 Python files with a total of 1,847 lines."

We didn't hardcode this sequence. We didn't tell the model "first search, then read, then calculate." The model figured out the plan, executed it step by step, and decided when it was done. That's what makes it an agent.

Here's what that looks like in practice — try clicking through the trace below:

Try It: Agent Trace Visualizer

Step through a real agent conversation. Click each step to see the full content, or use Auto Play to animate.

Agent Execution Trace

Making Agents Reliable: The Hard-Won Patterns

Our 40-line agent works, but it'll get into trouble in production. Here are the patterns that make the difference between a demo and a system you can actually depend on.

1. Error as Data, Never as Crash

We already saw this: wrap every tool execution in try/except and return the error string to the model. But take it further — return useful errors:

try:
    result = self.tool_functions[name](**args)
except KeyError:
    result = f"Unknown tool '{name}'. Available tools: {list(self.tool_functions)}"
except TypeError as e:
    result = f"Wrong arguments for {name}: {e}. Check the schema."
except Exception as e:
    result = f"Tool '{name}' failed: {type(e).__name__}: {e}"

The model reads these errors, understands what went wrong, and often fixes its approach on the next turn. This is dramatically more robust than crashing.

2. The "Reasoning" Field Trick

Add a reasoning field to your tool schemas. This forces the model to explain its thinking before it acts:

{
    "type": "function",
    "function": {
        "name": "search_files",
        "description": "Search for files matching a glob pattern",
        "parameters": {
            "type": "object",
            "properties": {
                "reasoning": {
                    "type": "string",
                    "description": "Why are you searching for these files? What do you expect to find?"
                },
                "pattern": {
                    "type": "string",
                    "description": "Glob pattern, e.g. '**/*.py'"
                }
            },
            "required": ["reasoning", "pattern"]
        }
    }
}

Now every tool call includes the model's rationale, which helps with debugging and improves the model's decision-making (it's a form of chain-of-thought prompting baked into the tool schema).

3. Convergence Detection

Sometimes agents get stuck in a loop — calling the same tool with the same arguments repeatedly. Detect this and break out:

recent_calls = []

for turn in range(self.max_turns):
    # ... get response, check for tool calls ...

    for tool_call in msg.tool_calls:
        call_sig = (tool_call.function.name, tool_call.function.arguments)

        if recent_calls.count(call_sig) >= 2:
            return "Agent appears stuck — same tool called 3 times with identical args."

        recent_calls.append(call_sig)
        # ... execute tool ...

4. Token Budget Awareness

Every turn adds to the conversation history. Long-running agents will hit context window limits. The simplest defense: limit tool output size.

def truncate_result(result, max_chars=2000):
    """Keep tool results from blowing up the context window."""
    text = json.dumps(result) if not isinstance(result, str) else result
    if len(text) > max_chars:
        return text[:max_chars] + f"\n... (truncated, {len(text)} chars total)"
    return text

For serious agents, you'd also track cumulative token usage and summarize older messages when approaching the limit. But truncating tool output handles 80% of context blowups with one line of code.

The ReAct Pattern: Giving It a Name

What we've been building has a name: ReAct, from the 2023 paper by Yao et al. ReAct stands for Reason + Act, and it describes the cycle at the heart of every agent:

Thought: The model reasons about what to do next.
Action: The model calls a tool.
Observation: The tool result comes back.
Repeat until the model has enough information to answer.

With modern APIs that support native tool calling, you get ReAct "for free" — the model reasons in its text output, acts via tool calls, and observes via tool results. Our Agent class is a ReAct agent without us having to parse any text.

Before native tool calling existed, people implemented ReAct by giving the model a specially formatted prompt and parsing its text output with regex. Simon Willison's influential 2023 blog post showed the minimal version:

# The pre-tool-calling approach: parse structured text from the model
system = """You run in a loop of Thought, Action, Observation.

Use Thought to describe your reasoning.
Use Action to run one of these tools:
- search[query]
- calculate[expression]
- finish[answer]

Example:
Thought: I need to find the population of France.
Action: search[population of France]
Observation: 68 million
Thought: I have the answer.
Action: finish[68 million people]
"""

# Then parse "Action: toolname[args]" from the model's text output
import re
match = re.search(r"Action: (\w+)\[(.+?)\]", response_text)

This still works, and some developers prefer it for its simplicity and portability (it works with any LLM, even ones without native tool calling). But native tool calling is more reliable — the model outputs structured JSON instead of free text that might have formatting quirks.

The Anthropic Version: Same Pattern, Different Shapes

For completeness, here's the equivalent agent loop for Anthropic's API. Same pattern, different message structure:

import anthropic
import json

class ClaudeAgent:
    def __init__(self, system_prompt, tools, tool_functions, max_turns=10):
        self.client = anthropic.Anthropic()
        self.system_prompt = system_prompt
        self.tools = tools
        self.tool_functions = tool_functions
        self.max_turns = max_turns

    def run(self, user_message):
        messages = [{"role": "user", "content": user_message}]

        for turn in range(self.max_turns):
            response = self.client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=4096,
                system=self.system_prompt,
                tools=self.tools,
                messages=messages,
            )

            # Append the full assistant response
            messages.append({"role": "assistant", "content": response.content})

            # end_turn means the model is done — extract text and return
            if response.stop_reason == "end_turn":
                return "".join(
                    block.text for block in response.content
                    if hasattr(block, "text")
                )

            # Execute tools and collect results
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    try:
                        result = self.tool_functions[block.name](**block.input)
                    except Exception as e:
                        result = {"error": str(e)}

                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": json.dumps(result)
                            if not isinstance(result, str) else result,
                    })

            # Tool results go in a "user" message (not a "tool" role)
            messages.append({"role": "user", "content": tool_results})

        return "Agent reached maximum turns without finishing."

The structural differences are worth highlighting:

System prompt is a separate parameter, not a message in the array.
Tool calls are content blocks with type: "tool_use", mixed alongside text blocks. This means Claude can narrate what it's doing while calling tools — a nice UX feature for streaming.
Arguments in block.input are already a Python dict. No json.loads() needed.
Tool results go in a "user" message as tool_result content blocks, instead of OpenAI's dedicated "tool" role.
Stop signal is stop_reason == "end_turn" instead of checking for the absence of tool calls.

Different API surface, identical pattern underneath. If you can build the OpenAI version, porting to Claude (or any other provider with tool calling) is a 15-minute exercise in reading the docs.

From Toy to Production: What Comes Next

We've built a real agent — one that reasons across multiple steps, calls tools, handles errors, and decides when it's done. But production agent systems add more layers on top of this core:

Human-in-the-loop gates: Before executing dangerous tools (deleting files, sending emails, running code), pause and ask for approval. The model requests the action; a human confirms it.
Persistent memory: Our agent's memory dies when the loop ends. Production agents save key facts to a database or file so they can pick up where they left off across sessions.
Multi-agent orchestration: A "manager" agent that delegates subtasks to specialist agents, each with their own tools and expertise. The manager synthesizes the results.
Streaming: In production, you want to show the user what the agent is thinking and doing in real time, not just the final answer. Both APIs support streaming tool calls.

This isn't theoretical. The ralph-loop that built 21 games on this very site is an autonomous agent built on exactly this pattern — a bash while loop running Claude Code, with file-based tools, a task list, and a watchdog process for safety. Same idea, scaled up.

If you want frameworks, they exist: LangChain, CrewAI, AutoGen, the Agents SDK from OpenAI. But now you know what they abstract over. Under every framework is a while loop, a list of tools, and a model that decides what to do next.

The agent pattern is just an LLM in a while loop. Everything else is engineering.

References & Further Reading

Yao et al. — ReAct: Synergizing Reasoning and Acting in Language Models (2023) — The paper that named and formalized the Thought/Action/Observation loop.
OpenAI — Function Calling Guide — Official docs on tool definitions, parallel calling, and strict mode.
Anthropic — Tool Use with Claude — Official docs on Claude's tool calling, including server-side tools.
Simon Willison — A Simple Python ReAct Pattern — The influential minimal implementation that proved you don't need a framework.
Anthropic Engineering — Effective Harnesses for Long-Running Agents — Production patterns for context management, retry strategies, and agent safety.
DadOps — How Ralph Loop Works — A real autonomous agent that built 21 games overnight, using the patterns described in this post.
DadOps — Structured Output from LLMs — Deep dive into getting reliable JSON from LLMs, which is what tool calling depends on under the hood.