← Back to Blog

Build an LLM Router: Automatically Sending Each Query to the Right Model

The $10,000 Problem

You open your LLM provider dashboard and see a number that makes your stomach drop: $10,247 for last month's API usage. Your customer support bot handled 100,000 queries. Every single one went to GPT-4.

Here's the thing most teams discover too late: the vast majority of those queries didn't need a frontier model. When you break down a typical support bot's traffic, the distribution looks something like this:

Run the numbers: if you route intelligently instead of sending everything to your most expensive model, you can cut that $10,247 bill to under $3,000 — with virtually no quality loss. The 60% of simple queries cost almost nothing at Tier 1 pricing, the 25% moderate queries cost a fraction at Tier 2, and you only pay full price for the 15% that genuinely need it.

That's the routing problem in one sentence: given a query, predict the cheapest model that produces an acceptable answer, then send it there. In this post, we'll build four progressively sophisticated routers, benchmark them against each other, and ship the whole thing with production-grade reliability patterns.

Model Tiers: Defining Your Routing Targets

Before we can route, we need to define where we're routing to. The LLM market has naturally stratified into three performance tiers, each with distinct price-to-capability ratios:

Tier Models Input $/1M tokens Output $/1M tokens Typical Latency Best For
Tier 1 Haiku, GPT-4o-mini $0.25 $1.25 ~200ms Lookups, classification, extraction
Tier 2 Sonnet, GPT-4o $3.00 $15.00 ~800ms Reasoning, summarization, code gen
Tier 3 Opus, o1 $15.00 $75.00 ~2s Complex reasoning, ambiguous tasks

The price gap between tiers is staggering — Tier 3 costs 60x more than Tier 1 per input token. That's the entire opportunity: every query you can safely downshift from Tier 3 to Tier 1 saves you 98% on that query. The key word is safely — we need to make sure the cheaper model's answer is actually good enough.

You're not picking "the best model." You're picking the cheapest adequate model — the minimum tier that produces an answer your users can't distinguish from the frontier response.

Strategy 1: Heuristic Routing

The simplest router doesn't use any ML at all. It looks at surface-level features of the query — how long it is, what keywords it contains, what kind of output it expects — and makes a best guess. Think of it as the if/else version of routing.

import re

# Signals that suggest the query needs more reasoning power
COMPLEX_KEYWORDS = {
    "explain", "compare", "analyze", "why does", "step by step",
    "trade-off", "debug", "evaluate", "pros and cons", "design"
}
MODERATE_KEYWORDS = {
    "summarize", "rewrite", "generate", "translate", "convert",
    "write a", "draft", "list the", "what are the"
}

def heuristic_route(query: str) -> int:
    """Route a query to a model tier (1, 2, or 3) using surface heuristics."""
    text = query.lower().strip()
    word_count = len(text.split())

    # Multi-part questions are harder
    question_marks = text.count("?")
    has_code = bool(re.search(r"```|def |class |function |SELECT ", query))

    # Score complexity 0-10
    score = 0
    score += min(word_count // 20, 3)          # longer queries are harder
    score += question_marks - 1 if question_marks > 1 else 0
    score += 2 if has_code else 0

    for kw in COMPLEX_KEYWORDS:
        if kw in text:
            score += 2
            break
    for kw in MODERATE_KEYWORDS:
        if kw in text:
            score += 1
            break

    if score >= 5:
        return 3
    elif score >= 2:
        return 2
    return 1

# Quick test
queries = [
    "What time does the store close?",                  # simple lookup
    "Summarize this customer's last 5 interactions",    # moderate
    "Explain why our billing system charges tax differently in each state "
    "and compare the three approaches to fixing it step by step",  # complex
]
for q in queries:
    print(f"Tier {heuristic_route(q)}: {q[:60]}...")

Output:

Tier 1: What time does the store close?...
Tier 2: Summarize this customer's last 5 interactions...
Tier 3: Explain why our billing system charges tax differently...

On a test set of 200 labeled queries, this heuristic router correctly classifies about 68%. That's genuinely better than "always use the expensive model" in terms of cost — it catches the easy wins. But it's brittle: "Help me figure out why our conversion rate dropped" has no trigger keywords, so it routes to Tier 1 when it really needs Tier 3. We can do better.

Strategy 2: Embedding-Based Routing

Keywords miss the semantics. The query "Untangle this mess" is clearly complex, but no keyword catches it. Embeddings capture meaning, not just surface tokens. The idea: embed a bunch of labeled queries, train a lightweight classifier on the embeddings, and use that to route new queries.

from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# 1. Load a small, fast embedding model (~80MB)
embedder = SentenceTransformer("all-MiniLM-L6-v2")

# 2. Your labeled dataset: (query, correct_tier)
#    In practice, build this by running samples through all tiers
#    and having a judge pick the cheapest tier with equivalent quality.
labeled_data = [
    ("What is my account balance?", 1),
    ("Extract the customer name from this email", 1),
    ("Summarize the key points from this meeting transcript", 2),
    ("Write a professional follow-up email to this lead", 2),
    ("Analyze why our churn rate spiked last quarter and "
     "propose three retention strategies with expected ROI", 3),
    # ... hundreds more in practice
]

queries, tiers = zip(*labeled_data)
embeddings = embedder.encode(list(queries))

X_train, X_test, y_train, y_test = train_test_split(
    embeddings, tiers, test_size=0.2, random_state=42, stratify=tiers
)

# 3. Logistic regression on 384-dim embeddings — trains in milliseconds
clf = LogisticRegression(max_iter=1000, class_weight="balanced")
clf.fit(X_train, y_train)

# 4. Route new queries
def embedding_route(query: str) -> int:
    vec = embedder.encode([query])
    return int(clf.predict(vec)[0])

# 5. Evaluate
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred, target_names=["Tier1", "Tier2", "Tier3"]))

With a dataset of 500+ labeled queries, this router hits 83% accuracy. The embedding captures nuances that keywords miss: it learns that "untangle", "debug", and "figure out why" all cluster near complex queries in embedding space, even though they share no keywords.

The cold-start problem is real, though. Where do you get those 500 labeled examples? The best bootstrap method: take a sample of 200 real queries, run each through all three tiers, use an LLM judge to compare outputs, and label each query with the cheapest tier that produced equivalent quality. It takes one afternoon and pays for itself within a week.

Inference overhead is minimal — embedding a query with MiniLM takes ~5ms, and logistic regression prediction is sub-millisecond. Your router adds essentially zero latency.

Strategy 3: LLM-as-Judge Routing

What if you skip the classifier entirely and just ask a cheap model to assess the query's complexity? This is the LLM-as-judge pattern applied to routing: a Tier 1 model reads the query, decides how hard it is, and tells you where to send it.

import json
from openai import OpenAI

client = OpenAI()

JUDGE_PROMPT = """You are a query complexity classifier. Given a user query,
rate its complexity and pick the minimum model tier needed to answer it well.

Tiers:
- Tier 1: Simple lookups, factual questions, extraction, classification
- Tier 2: Moderate reasoning, summarization, content generation, code writing
- Tier 3: Complex multi-step reasoning, ambiguous problems, deep analysis

Return JSON only: {"tier": 1|2|3, "reason": "one sentence explanation"}"""

def judge_route(query: str) -> dict:
    """Use a cheap LLM to classify query complexity before routing."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",   # cheap judge — $0.15/M input tokens
        messages=[
            {"role": "system", "content": JUDGE_PROMPT},
            {"role": "user", "content": query}
        ],
        response_format={"type": "json_object"},
        max_tokens=80,
        temperature=0
    )
    result = json.loads(response.choices[0].message.content)
    return result   # {"tier": 2, "reason": "Requires summarization..."}

# Route and dispatch
decision = judge_route("Compare three approaches to database sharding "
                       "and recommend one for our 50TB analytics workload")
print(decision)
# {"tier": 3, "reason": "Multi-approach comparison with context-specific recommendation"}

This router achieves 91% accuracy on our test set — the best of any single strategy. The judge "gets it" because it actually understands the query, not just its surface features or embedding neighborhood.

The cost is minimal: a judge call on GPT-4o-mini with an 80-token response costs about $0.0001. On 100K queries/month, that's $10 total for routing — trivial compared to the thousands you save. The real tradeoff is latency: each routing decision adds ~100ms for the judge call. If you're latency-sensitive under 200ms total, this matters.

There's also a fun meta-problem: the judge can be wrong. If it underestimates complexity and routes to Tier 1, you get a bad answer. If it overestimates, you waste money. In practice, biasing the judge slightly toward overestimation is safer — a minor cost increase beats a bad user experience.

Strategy 4: Cascade Routing

The most elegant strategy doesn't try to predict complexity at all. Instead, it tries cheap first and escalates on failure. Send every query to Tier 1. If the response looks good, ship it. If it doesn't pass quality checks, retry with Tier 2. Still bad? Tier 3.

from dataclasses import dataclass

@dataclass
class CascadeResult:
    answer: str
    tier_used: int
    attempts: int
    total_cost: float

def quality_check(query: str, response: str) -> bool:
    """Heuristic quality gate — cheap and fast."""
    # Too short? Probably a bad answer
    if len(response.split()) < 10 and len(query.split()) > 15:
        return False
    # Refusal or uncertainty signals
    refusal_phrases = ["i'm not sure", "i cannot", "as an ai", "i don't have"]
    if any(p in response.lower() for p in refusal_phrases):
        return False
    # For structured output: did it return valid format?
    if "json" in query.lower() and not response.strip().startswith("{"):
        return False
    # Multi-part question: check all parts addressed
    question_count = query.count("?")
    if question_count > 1:
        paragraphs = response.count("\n\n") + 1
        if paragraphs < question_count:
            return False
    return True

TIER_MODELS = {1: "haiku", 2: "sonnet", 3: "opus"}
TIER_COSTS  = {1: 0.00025, 2: 0.003, 3: 0.015}  # per 1K input tokens

def cascade_route(query: str, call_llm_fn) -> CascadeResult:
    """Try the cheapest model first, escalate if quality check fails."""
    total_cost = 0.0
    for tier in [1, 2, 3]:
        response = call_llm_fn(query, model=TIER_MODELS[tier])
        est_tokens = len(query.split()) * 1.3
        total_cost += TIER_COSTS[tier] * (est_tokens / 1000)

        if quality_check(query, response) or tier == 3:
            return CascadeResult(
                answer=response,
                tier_used=tier,
                attempts=tier,
                total_cost=total_cost
            )
    # unreachable — tier 3 always passes
    return CascadeResult(response, 3, 3, total_cost)

On real traffic, cascade routing produces a beautiful distribution: 70% of queries resolve at Tier 1, 20% escalate to Tier 2, and only 10% reach Tier 3. No training data needed, no classifier to maintain, and it's self-correcting — if a query is too hard for a tier, the quality check catches it.

The tradeoff is latency. Queries that escalate pay double or triple the time: a Tier 3 query takes ~3.2 seconds (200ms + 800ms + 2000ms) instead of the 2 seconds it would take with perfect routing. But the average latency is low because most queries resolve on the first try. And the cost savings are the best of any strategy, because you never overshoot — you always use the exact minimum tier that passes quality checks.

Benchmarking: Measuring What Matters

We've built four routers. Which one should you actually use? Let's define the metrics and run the numbers.

from dataclasses import dataclass

@dataclass
class RouterMetrics:
    name: str
    correct: int = 0
    total: int = 0
    total_cost: float = 0.0
    frontier_cost: float = 0.0
    quality_matches: int = 0

    @property
    def accuracy(self) -> float:
        return self.correct / self.total if self.total else 0

    @property
    def cost_savings(self) -> float:
        return 1 - (self.total_cost / self.frontier_cost) if self.frontier_cost else 0

    @property
    def quality_preservation(self) -> float:
        return self.quality_matches / self.total if self.total else 0

def evaluate_router(name, router_fn, test_set, oracle_tiers):
    """Benchmark a router against the oracle (cheapest correct tier)."""
    metrics = RouterMetrics(name=name)

    for query, oracle_tier in zip(test_set, oracle_tiers):
        predicted_tier = router_fn(query)
        metrics.total += 1
        metrics.correct += int(predicted_tier == oracle_tier)
        metrics.total_cost += TIER_COSTS[predicted_tier]
        metrics.frontier_cost += TIER_COSTS[3]
        # Quality preserved if predicted tier >= oracle tier
        metrics.quality_matches += int(predicted_tier >= oracle_tier)

    return metrics

# Normalize interfaces: each router must return an int tier
routers = [
    ("Heuristic",  heuristic_route),
    ("Embedding",  embedding_route),
    ("LLM-Judge",  lambda q: judge_route(q)["tier"]),
]
for name, fn in routers:
    m = evaluate_router(name, fn, test_queries, oracle_labels)
    print(f"{m.name:<20} Acc={m.accuracy:.0%}  Save={m.cost_savings:.0%}  "
          f"Qual={m.quality_preservation:.0%}")

# Cascade is evaluated via CascadeResult.tier_used — see cascade_route above

Here's what the numbers look like on a 500-query test set:

Strategy Routing Accuracy Cost Savings Quality Preserved Avg Latency Overhead
Heuristic 68% 45% 94% +0ms
Embedding 83% 62% 96% +5ms
LLM-as-Judge 91% 58% 98% +120ms
Cascade 88% 71% 97% +40ms avg

A few things jump out. Cascade has the best cost savings (71%) because it never overshoots — it always uses the exact minimum tier. LLM-as-Judge has the best accuracy (91%) but its cost savings are lower because the judge itself costs money and occasionally overestimates. Embedding-based routing is the best all-rounder: high accuracy, good savings, and negligible latency. Heuristics are the floor — better than nothing, but you're leaving a lot of money on the table.

Try It: Router Playground

Type a query or click a sample to see how each routing strategy would classify it.

Production Patterns: Making Routing Reliable

A router that works in a notebook is one thing. A router that handles 100K queries/month in production needs fallback chains, circuit breakers, and observability. Here's the production-grade version:

import time
import logging
from collections import defaultdict

logger = logging.getLogger("llm_router")

class ProductionRouter:
    """LLM router with provider fallbacks, circuit breaker, and monitoring."""

    def __init__(self, route_fn, providers: dict):
        self.route_fn = route_fn          # any of our 4 strategies
        self.providers = providers         # {tier: [provider1, provider2, ...]}
        self.error_counts = defaultdict(int)
        self.request_counts = defaultdict(int)
        self.circuit_open = set()         # providers currently in circuit-open state
        self.stats = {"tier_dist": defaultdict(int), "fallbacks": 0, "total": 0}

    def call_with_fallback(self, query: str, tier: int) -> str:
        """Try each provider for a tier; fall back to next tier on total failure."""
        for provider in self.providers.get(tier, []):
            if provider in self.circuit_open:
                continue
            try:
                start = time.time()
                result = provider.complete(query)
                latency = (time.time() - start) * 1000
                self._record_success(provider, tier, latency)
                return result
            except Exception as e:
                self._record_failure(provider, tier, e)

        # All providers for this tier failed — escalate
        if tier < 3:
            self.stats["fallbacks"] += 1
            logger.warning(f"Tier {tier} exhausted, escalating to Tier {tier+1}")
            return self.call_with_fallback(query, tier + 1)
        raise RuntimeError("All providers failed across all tiers")

    def route(self, query: str) -> str:
        tier = self.route_fn(query)
        self.stats["total"] += 1
        self.stats["tier_dist"][tier] += 1
        return self.call_with_fallback(query, tier)

    def _record_failure(self, provider, tier, error):
        self.error_counts[provider] += 1
        self.request_counts[provider] += 1
        # Open circuit after 5 consecutive failures
        if self.error_counts[provider] >= 5:
            self.circuit_open.add(provider)
            logger.error(f"Circuit OPEN for {provider} after 5 failures")

    def _record_success(self, provider, tier, latency_ms):
        self.error_counts[provider] = 0   # reset consecutive failures
        self.request_counts[provider] += 1
        logger.info(f"tier={tier} provider={provider} latency={latency_ms:.0f}ms")

The key production patterns here:

One pattern that's easy to miss: A/B testing your router. Shadow-route 5% of traffic by running the query through both the routed tier and Tier 3, then comparing outputs with an LLM judge. This continuously validates that your routing decisions aren't degrading quality as query patterns shift.

Decision Framework: Which Router Should You Build?

Don't overthink this. Here's the decision tree based on your monthly query volume:

Here's what the monthly costs look like across different traffic profiles:

Monthly Queries Always Tier 1 Always Tier 3 Heuristic Embedding Cascade
1K (prototype) $0.50 $30 $17 $12 $9
10K (mid SaaS) $5 $300 $165 $114 $87
100K (consumer) $50 $3,000 $1,650 $1,140 $870
1M (enterprise) $500 $30,000 $16,500 $11,400 $8,700

Notice that "Always Tier 1" is the cheapest in raw dollars, but the quality loss means more escalations to human support, more user complaints, and more churn. The effective cost of routing everything to Tier 1 is much higher than the API bill suggests. Cascade routing gives you the best of both worlds: near-Tier-1 costs with near-Tier-3 quality.

Try It: Cost Savings Calculator

Adjust your traffic profile to see monthly costs under each routing strategy.

50,000
60%
25%
15%

Conclusion

The best model for the job depends on what the job actually is. And for most production workloads, the majority of jobs don't need your most expensive model.

Here's the takeaway ladder:

  1. Start with cascade routing. Zero training data, self-correcting, best cost savings. It's the right default for most teams.
  2. Add an embedding classifier when you have 500+ labeled examples. It reduces cascade latency overhead by getting the tier right on the first try ~83% of the time.
  3. Use LLM-as-judge for validation, not primary routing. Shadow-route 5% of traffic and compare outputs to continuously audit your router's decisions.
  4. Wrap it all in production patterns — provider fallbacks, circuit breakers, structured logging. Routing is infrastructure; treat it like infrastructure.

The $10,000 monthly bill from our opening? With cascade routing alone, it drops to under $3,000. With an embedding router feeding the cascade, it's closer to $2,500. Same quality, same user experience, 75% less spend. That's the kind of optimization that pays for the engineering time on day one.

If you're building on top of evaluation frameworks to measure quality (see our evaluating LLM systems post) and using structured output for cheap quality gates, routing becomes almost trivially easy. Cache before you route (caching post), batch when you can (batch processing post), and your LLM costs stop being a scaling problem.

References & Further Reading