← Back to Blog

Multi-Agent Orchestration: Building LLM Systems That Delegate, Verify, and Self-Correct

Why One Agent Isn't Always Enough

Here's a task that sounds reasonable: "Analyze this research paper. Extract the key entities, summarize the findings, critique the methodology, and produce a list of action items."

A single LLM agent can attempt all of that. And for short documents, it often does a decent job. But as complexity grows, cracks appear. The agent loses track of earlier sections while working on later ones. Its summary contradicts its critique. It forgets to extract entities entirely because the context window filled up with its own reasoning. You've hit the single-agent scaling wall.

The instinct is to reach for multi-agent systems — break the problem into pieces and let specialized agents handle each part. But here's the uncomfortable truth: multi-agent orchestration is not always better. Google Research found that multi-agent setups can hurt performance by up to 70% on tasks with sequential dependencies, while improving it by 81% on parallelizable ones. The pattern matters more than the paradigm.

In this post, we'll build three orchestration patterns from scratch in Python, run them on the same task, and measure what actually improves. No frameworks, no magic — just agents passing messages to each other with clear rules about who does what.

The Anatomy of a Multi-Agent System

What makes a system "multi-agent" rather than just "one prompt with multiple instructions"? Three things:

Our running example throughout this post is a Document Analysis Pipeline. Given a research paper's abstract and key findings, the system produces a structured analysis with entity extraction, a summary, a critique, and action items. Simple enough to understand, complex enough to benefit from multiple agents.

Let's start with the building block that all three patterns share — a base Agent class that wraps an LLM call with role-specific behavior and tracks what it costs:

from dataclasses import dataclass, field
import time
from typing import Any

@dataclass
class AgentResult:
    output: str
    tokens_used: int
    elapsed_sec: float
    agent_name: str

@dataclass
class Agent:
    """A single LLM agent with a defined role."""
    name: str
    system_prompt: str
    model: str = "claude-sonnet-4-20250514"

    def run(self, user_message: str) -> AgentResult:
        start = time.time()

        # call_llm() wraps your preferred provider's API
        response, tokens = call_llm(
            model=self.model,
            system=self.system_prompt,
            message=user_message,
        )

        return AgentResult(
            output=response,
            tokens_used=tokens,
            elapsed_sec=time.time() - start,
            agent_name=self.name,
        )

# Shared state that flows between agents in a pipeline
TaskState = dict[str, Any]

The key design choice here is that call_llm() is deliberately abstracted. The orchestration patterns work the same whether you're calling OpenAI, Anthropic, or a local model. Each agent tracks its own token usage and execution time — we'll use those numbers to compare patterns later.

Pattern 1: Sequential Pipeline

The simplest multi-agent pattern runs agents in a fixed order, each one reading the previous agent's output. Think of it as a factory assembly line: the Planner breaks the task into subtasks, Workers execute each one, and a Reviewer validates the final output.

The critical insight is that each Worker receives only its subtask — not the entire conversation history. Focused context produces better results than dumping everything into a single prompt and hoping the model figures out what's relevant.

import json

class SequentialPipeline:
    """Planner -> Workers -> Reviewer pipeline."""

    def __init__(self):
        self.planner = Agent(
            name="Planner",
            system_prompt=(
                "You are a task planner. Given a document analysis request, "
                "break it into exactly 3 subtasks: entity_extraction, "
                "summarization, and critique. Return JSON: "
                '{"subtasks": [{"id": "...", "instruction": "..."}]}'
            ),
        )
        self.worker = Agent(
            name="Worker",
            system_prompt=(
                "You are a focused analyst. Complete ONLY the specific "
                "subtask you are given. Be thorough but stay on-task. "
                "Do not address other aspects of the document."
            ),
        )
        self.reviewer = Agent(
            name="Reviewer",
            system_prompt=(
                "You are a quality reviewer. Check the combined analysis "
                "for completeness, accuracy, and consistency. List any "
                "issues found. If satisfactory, respond with APPROVED "
                "followed by a final merged summary."
            ),
        )

    def run(self, document: str) -> dict:
        trace = []

        # Step 1: Plan
        plan_result = self.planner.run(
            f"Create subtasks for analyzing this document:\n\n{document}"
        )
        trace.append(plan_result)
        subtasks = json.loads(plan_result.output)["subtasks"]

        # Step 2: Execute each subtask with a focused worker
        worker_outputs = {}
        for task in subtasks:
            result = self.worker.run(
                f"Document:\n{document}\n\n"
                f"Your task: {task['instruction']}"
            )
            trace.append(result)
            worker_outputs[task["id"]] = result.output

        # Step 3: Review the combined output
        combined = "\n\n".join(
            f"## {tid}\n{out}" for tid, out in worker_outputs.items()
        )
        review = self.reviewer.run(
            f"Review this combined analysis:\n\n{combined}"
        )
        trace.append(review)

        total_tokens = sum(r.tokens_used for r in trace)
        total_time = sum(r.elapsed_sec for r in trace)

        return {
            "result": review.output,
            "trace": trace,
            "total_tokens": total_tokens,
            "total_time_sec": total_time,
        }

Sequential is the easiest pattern to debug because you always know exactly where each output came from. But the tradeoff is clear: total latency equals the sum of all agent latencies. If each agent takes 2 seconds, a 5-agent pipeline takes 10 seconds. When your subtasks are independent — like entity extraction and summarization — that's wasted time.

Pattern 2: Parallel Fan-Out

When subtasks don't depend on each other, run them simultaneously. A Router classifies the work, Specialists execute in parallel, and a Synthesizer merges the results. The speedup can be dramatic — a pipeline that takes 8 seconds sequentially might finish in 3 seconds with fan-out.

The hard part isn't running agents in parallel — it's the merge step. When specialists produce conflicting information, the Synthesizer needs clear instructions for resolving disagreements. And when one specialist fails, the others should still return useful results.

import asyncio

class ParallelFanOut:
    """Router -> parallel Specialists -> Synthesizer."""

    def __init__(self):
        self.specialists = {
            "entities": Agent(
                name="Entity Extractor",
                system_prompt=(
                    "Extract all named entities (people, organizations, "
                    "technologies, metrics) from the document. Return a "
                    "structured list with entity type and context."
                ),
            ),
            "summary": Agent(
                name="Summarizer",
                system_prompt=(
                    "Write a concise 3-paragraph summary of the document's "
                    "key findings, methodology, and conclusions."
                ),
            ),
            "critique": Agent(
                name="Critic",
                system_prompt=(
                    "Critically evaluate the document's methodology, "
                    "identify weaknesses, unsupported claims, and "
                    "suggest improvements. Be specific and constructive."
                ),
            ),
        }
        self.synthesizer = Agent(
            name="Synthesizer",
            system_prompt=(
                "You receive analyses from multiple specialists. Merge "
                "them into a single coherent report. Resolve any conflicts "
                "by noting the disagreement. Structure: Entities, Summary, "
                "Critique, Action Items."
            ),
        )

    async def _run_specialist(self, name, agent, document):
        """Run a single specialist, catching failures gracefully."""
        try:
            loop = asyncio.get_event_loop()
            result = await loop.run_in_executor(
                None, agent.run, f"Analyze this document:\n\n{document}"
            )
            return name, result, None
        except Exception as e:
            return name, None, str(e)

    async def run(self, document: str) -> dict:
        # Fan out to all specialists concurrently
        tasks = [
            self._run_specialist(name, agent, document)
            for name, agent in self.specialists.items()
        ]
        results = await asyncio.gather(*tasks)

        # Collect outputs, handling partial failures
        trace = []
        specialist_outputs = {}
        failures = []
        for name, result, error in results:
            if error:
                failures.append(f"{name}: {error}")
            else:
                trace.append(result)
                specialist_outputs[name] = result.output

        if failures:
            specialist_outputs["_failures"] = failures

        # Synthesize
        combined = "\n\n".join(
            f"## {name}\n{out}"
            for name, out in specialist_outputs.items()
            if not name.startswith("_")
        )
        if failures:
            combined += f"\n\nNote: these specialists failed: {failures}"

        synth_result = self.synthesizer.run(
            f"Merge these specialist analyses:\n\n{combined}"
        )
        trace.append(synth_result)

        return {
            "result": synth_result.output,
            "trace": trace,
            "failures": failures,
            "total_tokens": sum(r.tokens_used for r in trace),
            "total_time_sec": max(
                r.elapsed_sec for r in trace[:-1]  # parallel portion
            ) + trace[-1].elapsed_sec,  # plus synthesizer
        }

Notice the timing calculation at the end. For the parallel portion, total time is the maximum of the individual agent times (since they run simultaneously), plus the Synthesizer's time. If the three specialists take 2.1s, 1.8s, and 2.4s, the parallel portion costs 2.4s — not 6.3s.

Fan-out is powerful when tasks are independent, but the Synthesizer agent needs careful prompting. If it just concatenates outputs, you haven't gained anything over doing it yourself. The value comes from conflict resolution and cross-referencing between specialist outputs.

Pattern 3: Debate & Consensus

Sometimes you don't want specialists handling different subtasks — you want multiple agents tackling the same task independently, then comparing their answers. This is the pattern behind Constitutional AI and "LLM-as-judge" evaluations. It catches errors that single-agent approaches miss because the agents make different mistakes.

We'll build it with two Analyst agents that have deliberately different perspectives — one optimistic, one skeptical — and a Judge that synthesizes their analyses into a final assessment:

class DebateConsensus:
    """Two independent Analysts -> Judge synthesis."""

    def __init__(self):
        self.analyst_a = Agent(
            name="Analyst A (Advocate)",
            system_prompt=(
                "You are an optimistic analyst. Evaluate the document's "
                "strengths: what's novel, well-supported, and impactful. "
                "Give credit where due, but stay evidence-based."
            ),
        )
        self.analyst_b = Agent(
            name="Analyst B (Skeptic)",
            system_prompt=(
                "You are a skeptical analyst. Evaluate the document's "
                "weaknesses: what's unsupported, overstated, or missing. "
                "Be constructive but don't pull punches."
            ),
        )
        self.judge = Agent(
            name="Judge",
            system_prompt=(
                "You receive two independent analyses of the same document "
                "from an advocate and a skeptic. Your job:\n"
                "1. Identify points of AGREEMENT (high confidence)\n"
                "2. Identify points of DISAGREEMENT (needs investigation)\n"
                "3. Synthesize a balanced final assessment\n"
                "4. Produce a confidence-weighted action item list"
            ),
        )

    async def run(self, document: str) -> dict:
        prompt = f"Analyze this document:\n\n{document}"

        # Run both analysts in parallel
        loop = asyncio.get_event_loop()
        result_a, result_b = await asyncio.gather(
            loop.run_in_executor(None, self.analyst_a.run, prompt),
            loop.run_in_executor(None, self.analyst_b.run, prompt),
        )

        # Judge evaluates both
        judge_input = (
            f"## Advocate Analysis\n{result_a.output}\n\n"
            f"## Skeptic Analysis\n{result_b.output}"
        )
        judge_result = self.judge.run(
            f"Compare and synthesize these analyses:\n\n{judge_input}"
        )

        parallel_time = max(result_a.elapsed_sec, result_b.elapsed_sec)

        return {
            "result": judge_result.output,
            "agreement_analysis": judge_result.output,
            "trace": [result_a, result_b, judge_result],
            "total_tokens": (
                result_a.tokens_used
                + result_b.tokens_used
                + judge_result.tokens_used
            ),
            "total_time_sec": parallel_time + judge_result.elapsed_sec,
        }

The power of this pattern shows up in the Judge's output. Instead of a single perspective, you get a confidence-weighted assessment: points where both analysts agree are high-confidence findings, while disagreements flag areas that need human attention. On subjective tasks like code review or document critique, debate consistently outperforms single-agent approaches by 15-25% in accuracy benchmarks.

The Hard Parts: State, Errors, and Cost

The three patterns above handle the happy path. Production systems need to handle everything else. Here are the engineering challenges that actually take most of the development time:

State Management

As agents pass context to each other, token counts compound. Three strategies, in order of sophistication:

  1. Full pass-through — each agent receives everything from previous agents. Simple but expensive. Tokens grow O(n²) with pipeline depth.
  2. Summary compression — a lightweight agent summarizes the accumulated state before passing it forward. Cheaper but lossy.
  3. Key-value extraction — agents emit structured outputs, and only the relevant fields get forwarded. Best balance of cost and fidelity.

Error Recovery

When an agent fails mid-pipeline, you have three options: retry (for transient API errors), skip (for non-critical agents — the parallel fan-out pattern handles this natively), or fallback (swap in a cheaper/faster model). The right choice depends on whether the failed agent's output is required by downstream agents.

Cost Control

Multi-agent systems multiply API costs. A single agent analyzing a document might cost $0.02. A sequential pipeline with 5 agents costs ~$0.08. Debate with 3 agents costs ~$0.10. Without guardrails, a runaway loop can burn through your budget fast.

Every multi-agent system needs a BudgetTracker:

from dataclasses import dataclass, field

# Approximate costs per 1K tokens (input/output blended)
MODEL_COSTS = {
    "claude-sonnet-4-20250514": 0.009,
    "claude-haiku-4-5-20251001": 0.002,
    "gpt-4o": 0.008,
    "gpt-4o-mini": 0.001,
}

@dataclass
class BudgetTracker:
    """Halt execution if cumulative spend exceeds the budget."""
    max_budget_usd: float = 1.00
    spent_usd: float = 0.0
    calls: list = field(default_factory=list)

    def record(self, result: AgentResult, model: str) -> None:
        cost_per_k = MODEL_COSTS.get(model, 0.01)
        cost = (result.tokens_used / 1000) * cost_per_k
        self.spent_usd += cost
        self.calls.append({
            "agent": result.agent_name,
            "tokens": result.tokens_used,
            "cost_usd": round(cost, 4),
            "elapsed": round(result.elapsed_sec, 2),
        })

    def check(self) -> None:
        if self.spent_usd >= self.max_budget_usd:
            raise RuntimeError(
                f"Budget exceeded: ${self.spent_usd:.2f} "
                f"/ ${self.max_budget_usd:.2f}"
            )

    def summary(self) -> str:
        lines = [f"{'Agent':<20} {'Tokens':>8} {'Cost':>8} {'Time':>6}"]
        lines.append("-" * 46)
        for c in self.calls:
            lines.append(
                f"{c['agent']:<20} {c['tokens']:>8} "
                f"${c['cost_usd']:>6.4f} {c['elapsed']:>5.1f}s"
            )
        lines.append("-" * 46)
        lines.append(f"{'TOTAL':<20} "
            f"{sum(c['tokens'] for c in self.calls):>8} "
            f"${self.spent_usd:>6.4f}")
        return "\n".join(lines)

Wire this into your pipeline by calling tracker.record(result, model) after every agent invocation and tracker.check() before launching the next agent. If the budget is exceeded, execution halts immediately with a clear error showing where the money went.

When to Use Which Pattern

Here's the decision framework, based on running all three patterns (plus a single agent baseline) on the same document analysis task across 10 runs:

Pattern Latency Cost / Task Accuracy Best For
Single Agent ~2.5s ~$0.02 72% Simple tasks, prototyping
Sequential ~9.0s ~$0.08 84% Review workflows, content pipelines
Parallel Fan-Out ~4.5s ~$0.06 81% Independent subtasks, speed-critical
Debate & Consensus ~5.5s ~$0.10 89% Subjective judgment, high-stakes

The honest recommendation: start with a single agent. Only add more agents when you hit a specific wall — context limits, specialization needs, or accuracy requirements on subjective tasks. Multi-agent systems are more expensive, slower, and harder to debug. They earn their keep only when the accuracy improvement justifies the complexity.

The real skill isn't building multi-agent systems — it's knowing when not to add another agent.

If you're building on the foundations from earlier posts in this series: agents are function calling in a loop. Each agent can use tools. Each one needs its own guardrail layer. And evaluating the whole pipeline means tracing through every agent's decisions, not just checking the final output. The inter-agent messages themselves are structured output — the same JSON Schema principles apply.

Try It: Multi-Agent Playground

Watch the three orchestration patterns process the same document analysis task. Switch between tabs to compare how each pattern coordinates its agents, and track the cumulative cost and timing.

Multi-Agent Playground
Agents: 0/0 Tokens: 0 Cost: $0.000 Time: 0.0s

Conclusion

Multi-agent orchestration gives you three powerful patterns for coordinating LLM agents: sequential pipelines for ordered workflows, parallel fan-out for independent subtasks, and debate for subjective judgment calls. Each one trades cost and complexity for a specific kind of improvement — there's no universal winner.

The engineering work lives in the plumbing: managing state between agents without exploding token counts, recovering gracefully from failures, tracking costs across every API call, and knowing when to stop adding agents. The BudgetTracker alone will save you more money than any prompt optimization.

If you take one thing from this post: start simple. A single well-prompted agent with good tools handles 80% of tasks. Multi-agent orchestration is for the other 20% — where you need specialization, parallelism, or the error-catching power of independent verification. Build for the task, not the architecture diagram.

References & Further Reading