Load Testing AI APIs: Finding the Breaking Point

February 27, 2026 · Backend · 16 min read

Why AI APIs Break Differently

Your API handles 10 requests per second beautifully. At 50, latency doubles. At 100, everything falls apart. Somewhere between 10 and 100, something broke — but your standard load testing tools couldn't tell you what, when, or why.

If you're pointing wrk or Apache Bench at your LLM endpoint, you're getting a number. But it's the wrong number. LLM APIs aren't REST endpoints that return a JSON blob in one shot. They're streaming systems where a single response can take seconds, each token arrives individually, and the server's capacity depends on how much GPU memory is left for the KV cache — a constraint that no HTTP benchmarking tool knows about.

The root cause is that LLM inference has two fundamentally different phases. Prefill processes your entire prompt in parallel — it's compute-bound and fast. Decode generates tokens one at a time — it's memory-bandwidth-bound and sequential. Traditional APIs have roughly constant latency until saturation. LLM APIs are deeply non-linear: at concurrency 250, throughput can be 50x higher than at concurrency 1, while latency only increases 5x thanks to continuous batching amortizing the cost.

The real concurrency ceiling isn't your API gateway or your load balancer. It's GPU memory via the KV cache. Every concurrent request needs its own key-value cache that grows with context length. The formula is straightforward:

# KV cache memory per token per sequence
kv_bytes = 2 * num_layers * num_kv_heads * head_dim * bytes_per_param

# Llama-3 70B (FP16): 2 * 80 * 8 * 128 * 2 = 327,680 bytes per token
# With 8K context: 327,680 * 8192 = 2.56 GB per request
# 32 concurrent requests = 82 GB just for KV cache
# That EXCEEDS an 80 GB A100 — before counting model weights

Think of it like a kitchen. A traditional API is a checkout counter — each customer takes roughly the same time. An LLM API is a kitchen: each order takes a different amount of time, the chef works one step per dish in rotation (continuous batching), and the kitchen physically runs out of counter space as orders pile up. KV cache is the counter space.

This post is about finding the two curves that define every AI API's behavior: the throughput S-curve and the latency hockey stick. Once you can plot these curves, you can predict exactly where your system will break — and how close you are to the edge right now.

Building an Async Load Tester from Scratch

Why build your own load tester instead of using existing tools? Because existing tools have inconsistent metric definitions. LLMPerf includes TTFT in its inter-token latency calculation. GenAI-Perf doesn't. Neither handles your exact API contract, authentication scheme, or streaming format. A custom tester in ~50 lines of Python gives you full control over what you're measuring.

The core design: asyncio for concurrency, aiohttp for HTTP, and a semaphore to cap how many requests are in-flight simultaneously. Each request records four timestamps: when it was sent, when the first token arrived, each subsequent token, and the last token. From these, we derive everything.

import asyncio
import aiohttp
import time
from dataclasses import dataclass, field

@dataclass
class RequestMetrics:
    ttft: float = 0.0          # time to first token
    ttlt: float = 0.0          # time to last token
    token_count: int = 0
    tpot: float = 0.0          # time per output token
    token_times: list = field(default_factory=list)
    error: str = ""

async def stream_request(session, url, payload, semaphore):
    """Send one request, parse SSE stream, collect timing."""
    async with semaphore:
        m = RequestMetrics()
        start = time.perf_counter()
        try:
            async with session.post(url, json=payload) as resp:
                async for raw_line in resp.content:
                    line = raw_line.decode().strip()
                    if not line.startswith("data: "):
                        continue
                    now = time.perf_counter()
                    if m.token_count == 0:
                        m.ttft = now - start
                    m.token_times.append(now)
                    m.token_count += 1
                m.ttlt = time.perf_counter() - start
                if m.token_count > 1:
                    m.tpot = (m.ttlt - m.ttft) / (m.token_count - 1)
        except Exception as e:
            m.error = str(e)
        return m

def percentile(sorted_data, p):
    """Compute the p-th percentile from pre-sorted data."""
    if not sorted_data:
        return 0.0
    k = (len(sorted_data) - 1) * p / 100
    f = int(k)
    c = min(f + 1, len(sorted_data) - 1)
    return sorted_data[f] + (k - f) * (sorted_data[c] - sorted_data[f])

async def run_load_test(url, payload, concurrency, num_requests):
    """Fire num_requests at url with given concurrency cap."""
    sem = asyncio.Semaphore(concurrency)
    async with aiohttp.ClientSession() as session:
        tasks = [
            stream_request(session, url, payload, sem)
            for _ in range(num_requests)
        ]
        results = await asyncio.gather(*tasks)

    ok = [r for r in results if not r.error]
    ttfts = sorted(r.ttft for r in ok)
    total_tokens = sum(r.token_count for r in ok)
    wall = max(r.ttlt for r in ok) if ok else 1

    print(f"Concurrency: {concurrency}")
    print(f"  Successful: {len(ok)}/{num_requests}")
    print(f"  Throughput: {total_tokens / wall:.0f} tokens/sec")
    print(f"  TTFT  p50={percentile(ttfts, 50)*1000:.0f}ms"
          f"  p95={percentile(ttfts, 95)*1000:.0f}ms"
          f"  p99={percentile(ttfts, 99)*1000:.0f}ms")
    return results

A few things worth noting. The asyncio.Semaphore is the fundamental concurrency knob — it limits how many requests are in-flight at once, which is exactly what we'll sweep in the next section. We parse SSE lines (data: ...) rather than buffering the full response, because we need per-token timing. And we compute percentiles rather than averages, because latency distributions for LLMs are heavily right-skewed — the p95 can be 5-10x the p50.

The Concurrency Curve — Finding the Sweet Spot

The concurrency curve is your central diagnostic. Run the load tester at exponentially increasing concurrency levels, plot the results, and two shapes emerge every single time:

Throughput (tokens/sec) vs. concurrency — an S-curve. Linear growth at low concurrency (GPU underutilized), sub-linear in the middle (continuous batching working efficiently), flat at saturation (GPU bandwidth maxed), then declining past saturation (overhead, memory pressure, errors).
Latency (TTFT p95) vs. concurrency — a hockey stick. Roughly flat at low concurrency, a gentle slope as batch sizes grow, then a sharp inflection where requests start queuing because there's no more room in the KV cache.

The breaking point is where the hockey stick bends. That single number tells you more about your system's capacity than any other metric.

async def concurrency_sweep(url, payload, levels=None, requests_per=100):
    """Sweep concurrency levels and collect the two curves."""
    if levels is None:
        levels = [1, 2, 4, 8, 16, 32, 64, 128]

    rows = []
    for conc in levels:
        results = await run_load_test(url, payload, conc, requests_per)
        ok = [r for r in results if not r.error]
        ttfts = sorted(r.ttft for r in ok)
        total_tok = sum(r.token_count for r in ok)
        wall = max(r.ttlt for r in ok) if ok else 1
        err_rate = (len(results) - len(ok)) / len(results) * 100

        rows.append({
            "concurrency": conc,
            "throughput": total_tok / wall,
            "ttft_p50": percentile(ttfts, 50) * 1000,
            "ttft_p95": percentile(ttfts, 95) * 1000,
            "ttft_p99": percentile(ttfts, 99) * 1000,
            "error_pct": err_rate,
        })

    # Print results table
    print(f"{'Conc':>6} {'TPS':>8} {'p50ms':>8} {'p95ms':>8} "
          f"{'p99ms':>8} {'Err%':>6}")
    print("-" * 50)
    for r in rows:
        print(f"{r['concurrency']:>6} {r['throughput']:>8.0f} "
              f"{r['ttft_p50']:>8.0f} {r['ttft_p95']:>8.0f} "
              f"{r['ttft_p99']:>8.0f} {r['error_pct']:>5.1f}%")
    return rows

Here's what the output looks like for a Llama-3 8B model on a single A100 80GB GPU, tested with 4K-context prompts:

Concurrency	Throughput (tok/s)	TTFT p50	TTFT p95	TTFT p99	Error Rate
1	58	45 ms	52 ms	61 ms	0%
2	112	47 ms	56 ms	68 ms	0%
4	218	51 ms	64 ms	82 ms	0%
8	410	58 ms	79 ms	105 ms	0%
16	735	72 ms	115 ms	162 ms	0%
32	1,180	98 ms	195 ms	340 ms	0%
64	1,450	145 ms	420 ms	980 ms	0.5%
128	1,310	290 ms	1,350 ms	3,800 ms	4.2%

Read the table vertically. Throughput climbs from 58 to 1,450 tokens/sec — a 25x increase — then drops at 128 concurrent. Meanwhile TTFT p95 barely moves until concurrency 32, then explodes from 195 ms to 1,350 ms. The breaking point is right around 64 concurrent requests. Beyond that, you're past the knee of the hockey stick.

Little's Law connects these curves into a capacity planning formula: concurrency = throughput × latency. If your API sustains 10 requests/sec with 2-second average latency, you have 20 requests in flight at any moment. Flip it around for planning: required_concurrency = target_throughput × measured_latency. The sweet spot is 70-80% of your measured saturation concurrency — enough to keep the GPU busy, with headroom for traffic spikes.

Try It: Concurrency Curve Explorer

Adjust model size, context length, and GPU memory to see how the throughput S-curve and latency hockey stick shift. Watch the GPU memory bar fill up as concurrent requests compete for KV cache space.

Model: 7B Context: 4K GPU: 80 GB

Load Patterns That Reveal Real Problems

A single concurrency sweep tells you where the system breaks. To understand how it breaks in production, you need different load patterns. Each one is designed to surface a specific category of failure:

Ramp-up (1 → N over T seconds): Smoothly increases load. Maps the full concurrency curve in one run. Finds the inflection point.
Sustained (constant N for an extended period): Holds steady load. Reveals memory leaks, KV cache fragmentation, connection pool exhaustion — anything that degrades over time.
Burst (periodic 10x spikes): Slams the system with sudden demand. Tests whether your autoscaler reacts fast enough, whether the queue drains cleanly, whether the system crashes or degrades gracefully.
Soak test (moderate load for hours): The slow-burn test. Catches GPU memory not being freed after requests complete, especially common with self-hosted vLLM or TGI deployments where the memory allocator fragments over time.

Ramp finds the ceiling. Sustained reveals leaks. Bursts test queuing. Soak catches fragmentation.

import time

def ramp_pattern(peak_concurrency, duration_sec, steps=20):
    """Linearly ramp from 1 to peak over duration."""
    step_time = duration_sec / steps
    for i in range(steps + 1):
        target = max(1, int(peak_concurrency * i / steps))
        yield time.time() + i * step_time, target

def sustained_pattern(concurrency, duration_sec, interval=1.0):
    """Hold constant concurrency for duration."""
    steps = int(duration_sec / interval)
    for i in range(steps):
        yield time.time() + i * interval, concurrency

def burst_pattern(base, multiplier, duration_sec, burst_every=10):
    """Base load with periodic bursts of multiplier * base."""
    t = 0
    while t < duration_sec:
        in_burst = (t % burst_every) >= (burst_every * 0.7)
        target = base * multiplier if in_burst else base
        yield time.time() + t, target
        t += 0.5

def soak_pattern(concurrency, duration_sec, interval=1.0):
    """Constant load — pair with latency monitoring for drift."""
    steps = int(duration_sec / interval)
    for i in range(steps):
        yield time.time() + i * interval, concurrency

def build_test_plan(phases):
    """Compose patterns into a full test plan.

    phases = [
        ("ramp", {"peak_concurrency": 64, "duration_sec": 30}),
        ("sustained", {"concurrency": 48, "duration_sec": 120}),
        ("burst", {"base": 32, "multiplier": 5, "duration_sec": 60}),
    ]
    """
    plan = []
    for name, kwargs in phases:
        gen = {"ramp": ramp_pattern, "sustained": sustained_pattern,
               "burst": burst_pattern, "soak": soak_pattern}[name]
        plan.extend(gen(**kwargs))
    return plan

Each generator yields (timestamp, target_concurrency) pairs that you feed into the load tester's semaphore. The build_test_plan function chains patterns into a multi-phase test: ramp to find the ceiling, sustain at 75% to check stability, then burst to test recovery. Compose them like building blocks.

Try It: Load Pattern Simulator

Pick a load pattern and watch how concurrency, queue depth, and latency evolve over time. The server has a fixed capacity — see what happens when demand exceeds it.

Pattern: Base load: 20 Capacity: 30

AI-Specific Metrics That Matter

Standard HTTP metrics — response time, status codes, bytes transferred — are necessary but insufficient for AI APIs. A 200 OK with a 3-second TTFT and jittery token delivery is a degraded experience even though every request technically "succeeded." Here are the six metrics you actually need:

TTFT (Time to First Token): How quickly the user sees something. Target <200 ms for chat. This is what makes an API feel responsive even when total generation takes seconds.
TPOT (Time Per Output Token): The streaming speed. Target <30 ms for smooth reading pace (~33 tokens/sec). Higher means visible stuttering.
ITL variance (inter-token latency jitter): Inconsistent speed feels worse than consistently slow. Compute std(itl) / mean(itl) — a coefficient of variation above 0.5 indicates problematic jitter.
System TPS (total tokens/second across all requests): Your raw throughput capacity. This is what scales with hardware and batching.
Cost per request: total_tokens × price_per_token. Track this in your load tests — you want to know if a configuration change that improves latency also doubles your spend.
p99/p50 ratio: The tail latency multiplier. Under 3x is healthy. 3-10x is normal under load. Above 10x means heavy queuing — you're past the breaking point.

Metric	Good	Acceptable	Problematic
TTFT (chat)	<200 ms	200 ms – 1 s	>2 s
TPOT	<30 ms	30–80 ms	>100 ms
System TPS (7B, 1×A100)	>2,000	1,000–2,000	<500
System TPS (70B, 1×A100)	>200	100–200	<50
p99/p50 ratio	<3×	3–10×	>10×
Error rate	<0.1%	0.1–1%	>1%

The code below extends our load tester to compute all six metrics and flag anything outside the "good" range:

def compute_ai_metrics(results, cost_per_token=0.0):
    """Compute all six AI-specific metrics from load test results."""
    ok = [r for r in results if not r.error]
    if not ok:
        return {"error": "No successful requests"}

    ttfts = sorted(r.ttft for r in ok)
    tpots = sorted(r.tpot for r in ok if r.tpot > 0)
    total_tokens = sum(r.token_count for r in ok)
    wall = max(r.ttlt for r in ok)

    # Inter-token latency jitter
    all_itls = []
    for r in ok:
        for i in range(1, len(r.token_times)):
            all_itls.append(r.token_times[i] - r.token_times[i - 1])
    itl_cv = 0.0
    if all_itls:
        mean_itl = sum(all_itls) / len(all_itls)
        var_itl = sum((x - mean_itl) ** 2 for x in all_itls) / len(all_itls)
        itl_cv = (var_itl ** 0.5) / mean_itl if mean_itl > 0 else 0

    metrics = {
        "ttft_p50":   percentile(ttfts, 50) * 1000,
        "ttft_p95":   percentile(ttfts, 95) * 1000,
        "tpot_p50":   percentile(tpots, 50) * 1000 if tpots else 0,
        "system_tps": total_tokens / wall,
        "itl_cv":     itl_cv,
        "p99_p50":    percentile(ttfts, 99) / max(percentile(ttfts, 50), 1e-9),
        "error_pct":  (len(results) - len(ok)) / len(results) * 100,
        "cost":       total_tokens * cost_per_token,
    }

    # Threshold checks
    checks = [
        ("TTFT p95",    metrics["ttft_p95"],   200, 1000,  "ms"),
        ("TPOT p50",    metrics["tpot_p50"],   30,  80,    "ms"),
        ("System TPS",  metrics["system_tps"], 2000, 1000, "tok/s"),  # inverted
        ("ITL jitter",  metrics["itl_cv"],     0.3, 0.5,   "CV"),
        ("p99/p50",     metrics["p99_p50"],    3,   10,    "x"),
        ("Error rate",  metrics["error_pct"],  0.1, 1.0,   "%"),
    ]

    for name, val, good_thresh, warn_thresh, unit in checks:
        if name == "System TPS":
            status = "GOOD" if val > good_thresh else (
                     "WARN" if val > warn_thresh else "FAIL")
        else:
            status = "GOOD" if val < good_thresh else (
                     "WARN" if val < warn_thresh else "FAIL")
        marker = {"GOOD": "+", "WARN": "~", "FAIL": "!"}[status]
        print(f"  [{marker}] {name:<14} {val:>8.1f} {unit:<6} [{status}]")

    return metrics

One gotcha that trips up nearly everyone: Nginx and most reverse proxies buffer responses by default. The standard buffer is ~16 KB, which means your TTFT measurement includes the time to fill that buffer — not the time to the actual first token. Fix this with the X-Accel-Buffering: no header, or set proxy_buffering off in your Nginx config. Without this, your TTFT numbers are meaningless.

Detecting Degradation Before Your Users Do

LLM inference latency is inherently noisy. Variable prompt lengths, GPU thermal throttling, non-deterministic scheduling in the serving framework — you'll see 10-20% variance between runs even with identical configurations. The challenge: distinguish a real regression from random noise.

The solution is Welch's t-test, which is preferred over Student's t-test because it handles unequal variances between the two samples (baseline vs. candidate). The workflow:

Run baseline: 100+ requests at a fixed concurrency level.
Deploy your change (new model version, config tweak, infrastructure update).
Run the identical test: same prompts, same concurrency, same hardware.
Apply Welch's t-test to the TTFT distributions.
If p-value < 0.05 AND the effect size exceeds your threshold — flag it.

The "AND" matters. A statistically significant 1 ms difference isn't a real problem. We use Cohen's d (standardized effect size) and a percentage threshold to avoid false alarms.

import math

def welch_t_test(a, b):
    """Welch's t-test for unequal variances. Returns t-stat and p-value."""
    n_a, n_b = len(a), len(b)
    mean_a = sum(a) / n_a
    mean_b = sum(b) / n_b
    var_a = sum((x - mean_a) ** 2 for x in a) / (n_a - 1)
    var_b = sum((x - mean_b) ** 2 for x in b) / (n_b - 1)

    se = math.sqrt(var_a / n_a + var_b / n_b)
    t_stat = (mean_a - mean_b) / se if se > 0 else 0

    # Welch-Satterthwaite degrees of freedom
    num = (var_a / n_a + var_b / n_b) ** 2
    den = (var_a / n_a) ** 2 / (n_a - 1) + (var_b / n_b) ** 2 / (n_b - 1)
    df = num / den if den > 0 else 1

    # Approximate two-tailed p-value using normal for large df
    p_value = 2 * math.erfc(abs(t_stat) / math.sqrt(2))

    return t_stat, p_value, df

def detect_regression(baseline_ttfts, candidate_ttfts, threshold_pct=10):
    """Compare two sets of TTFT values. Return pass/fail verdict."""
    mean_base = sum(baseline_ttfts) / len(baseline_ttfts)
    mean_cand = sum(candidate_ttfts) / len(candidate_ttfts)
    pct_change = (mean_cand - mean_base) / mean_base * 100

    t_stat, p_value, df = welch_t_test(baseline_ttfts, candidate_ttfts)

    # Cohen's d for effect size
    pooled_std = math.sqrt(
        (sum((x - mean_base)**2 for x in baseline_ttfts) +
         sum((x - mean_cand)**2 for x in candidate_ttfts))
        / (len(baseline_ttfts) + len(candidate_ttfts) - 2)
    )
    cohens_d = abs(mean_cand - mean_base) / pooled_std if pooled_std > 0 else 0

    is_regression = p_value < 0.05 and pct_change > threshold_pct

    print(f"Baseline mean: {mean_base*1000:.1f} ms")
    print(f"Candidate mean: {mean_cand*1000:.1f} ms")
    print(f"Change: {pct_change:+.1f}%")
    print(f"Welch's t={t_stat:.2f}, p={p_value:.4f}, df={df:.1f}")
    print(f"Cohen's d={cohens_d:.2f}")
    print(f"Verdict: {'REGRESSION DETECTED' if is_regression else 'PASS'}")
    return is_regression

Integrate this into CI/CD: run load tests on every deployment, compare against the last known-good baseline, auto-block deploys that regress p95 TTFT by more than 15%. One caveat: Welch's t-test assumes roughly normal distributions. LLM latency is right-skewed, so for strict correctness you can log-transform the values first, or use a non-parametric Mann-Whitney U test. In practice, with 100+ samples, the t-test is robust enough for regression detection.

Conclusion

Every AI API has a breaking point, and it's determined by two curves: the throughput S-curve and the latency hockey stick. The inflection happens when GPU memory fills up with KV cache — a constraint that depends on model size, context length, and how many requests you're serving simultaneously.

The toolkit is straightforward: an async load tester that handles streaming, a concurrency sweep that maps both curves, load patterns that stress-test different failure modes, and a statistical detector that catches regressions before they reach production. Build the sweep first. Find your breaking point. Then stay at 70-80% of it.

The numbers in this post are for a specific configuration (Llama-3 8B on A100 80GB). Your numbers will be different. That's the whole point — run the tests yourself, on your hardware, with your prompts, at your expected traffic patterns. The curves will have the same shapes. Only the inflection points will move.

References & Further Reading

NVIDIA — Mastering LLM Techniques: Inference Optimization — Deep dive into TTFT, TPOT, ITL metrics and GPU optimization strategies
Anyscale — How Continuous Batching Enables 23x Throughput — Why batching is the single most important optimization for LLM serving
Ray Project — LLMPerf — Open-source LLM load testing tool for comparing providers
vLLM Documentation — PagedAttention and memory management for efficient LLM serving
Little's Law (Wikipedia) — The queuing theory foundation: L = λW
Welch's t-test (Wikipedia) — Statistical test for comparing means with unequal variances
Google Cloud — Vertex AI Quotas — Rate limits and error handling for managed AI API endpoints
DadOps — LLM API Latency — Single-request latency measurement baselines
DadOps — GPU Memory Benchmarks — KV cache memory constraints and GPU capacity planning
DadOps — LLM Serving at Scale — The serving architecture that determines these load limits
DadOps — Python Concurrency — Asyncio patterns used in the load tester