Serving LLMs at Scale: From Naive to vLLM
The Naive Serving Problem
Serving a ResNet classifier is a solved problem. A single forward pass takes about 5ms, the model weighs ~100MB, and you can handle thousands of requests per second on modest hardware. Serving an LLM is a completely different beast.
The core issue is that LLM inference is autoregressive: each token depends on all previous tokens. You can't compute the whole output in one shot. A single request might take 5–60 seconds, and memory grows dynamically as the KV cache expands with every generated token.
Let's make this concrete with Llama 2 7B. Model weights at FP16 occupy 7 billion parameters × 2 bytes = 14GB. Now for the KV cache — at a sequence length of 4096 tokens, each request needs:
batch × seq_len × 2 (K+V) × num_layers × head_dim × num_heads × sizeof(FP16)
= 1 × 4096 × 2 × 32 × 128 × 32 × 2 bytes ≈ 2GB per request
On a single 24GB GPU: 14GB for weights + 2GB for one request's KV cache = 16GB. You have room for maybe four concurrent requests before hitting the VRAM ceiling. If each request takes 10 seconds, that's 24 requests per minute. Production needs 600+.
That's a 25x gap — and it only gets worse with longer sequences or larger models. If you're not already familiar with KV cache mechanics, our KV Cache from Scratch post covers the memory math in detail. And if you want to shrink those weights, Quantization from Scratch shows you how to cut model size by 2–4x.
This post is about closing that throughput gap. We'll walk through four increasingly sophisticated serving strategies, benchmark them head-to-head, and build interactive visualizations so you can see exactly where the bottlenecks are.
Static Batching — The Obvious First Try
The simplest optimization is familiar from any ML serving context: batch your requests. Collect N incoming requests, pack them into a single batch, and process them together on the GPU. The matrix multiplications get amortized across the batch, and you squeeze more useful work out of each GPU cycle.
With a batch size of 8, you might see a 5x throughput improvement (not 8x — the decode phase is memory-bandwidth-bound, not compute-bound, so batching helps less than in pure compute workloads like image classification).
But static batching has a fatal flaw: padding waste. Imagine four requests arrive with output lengths [50, 100, 500, 2000] tokens. In a static batch, all four must run until the longest finishes. The first three requests are done generating but sit idle — and the GPU cycles spent on their padding positions are pure waste.
Let's measure exactly how bad this gets:
import random
import statistics
def simulate_static_batching(num_requests, batch_size, min_tokens, max_tokens, seed=42):
"""Simulate a static batching LLM server and measure waste."""
rng = random.Random(seed)
output_lengths = [rng.randint(min_tokens, max_tokens) for _ in range(num_requests)]
total_useful_steps = 0
total_padded_steps = 0
request_latencies = []
for batch_start in range(0, num_requests, batch_size):
batch = output_lengths[batch_start:batch_start + batch_size]
max_len = max(batch)
for length in batch:
total_useful_steps += length
total_padded_steps += max_len - length
# Every request in the batch waits for the longest one
request_latencies.append(max_len)
waste_pct = total_padded_steps / (total_useful_steps + total_padded_steps) * 100
throughput = total_useful_steps / sum(request_latencies) * batch_size
print(f"Static Batching (batch_size={batch_size})")
print(f" Requests: {num_requests}")
print(f" Output lengths: {min_tokens}-{max_tokens} tokens")
print(f" Useful tokens: {total_useful_steps:,}")
print(f" Padded tokens: {total_padded_steps:,}")
print(f" Padding waste: {waste_pct:.1f}%")
print(f" Avg latency: {statistics.mean(request_latencies):.0f} steps/request")
print(f" Relative throughput: {throughput:.2f} tokens/step")
# Wide variance in output length = lots of waste
simulate_static_batching(100, batch_size=8, min_tokens=20, max_tokens=500)
print()
# Narrow variance = less waste, but still blocked
simulate_static_batching(100, batch_size=8, min_tokens=200, max_tokens=300)
Run that and you'll see padding waste of 40–60% with variable output lengths. Even when the variance is small, those shorter requests in each batch are stuck waiting. Static batching achieves 40–60% GPU utilization at best.
The insight that breaks through this ceiling? Don't treat the batch as a monolith.
Continuous Batching — The Scheduling Revolution
The breakthrough came from a 2022 paper by Yu et al. called Orca, published at OSDI. The key idea is deceptively simple: instead of waiting for the entire batch to finish, schedule at the iteration level.
At every single decode step, the scheduler checks: has any request hit its EOS token or max length? If so, evict it from the batch and immediately slot in a waiting request from the queue. The GPU never idles while there's work to do.
Think of it like a restaurant with 8 tables (batch size 8). Static batching seats 8 guests, waits for all 8 to finish dessert, then clears and seats the next 8. Continuous batching seats a new guest the moment a table opens. Same kitchen, dramatically higher throughput.
The numbers are striking: on workloads with variable output lengths (20–500 tokens), continuous batching delivers 2–3x higher throughput than static batching. The improvement is directly proportional to the variance in output lengths — the more variable your traffic, the bigger the win.
import random
def simulate_continuous_batching(num_requests, batch_size, min_tokens, max_tokens, seed=42):
"""Simulate iteration-level scheduling with a request queue."""
rng = random.Random(seed)
output_lengths = [rng.randint(min_tokens, max_tokens) for _ in range(num_requests)]
queue = list(range(num_requests)) # Request IDs waiting
active = {} # {request_id: remaining_tokens}
completed = 0
total_steps = 0
latencies = {} # request_id -> (arrival_step, completion_step)
arrival_step = {i: 0 for i in range(num_requests)} # All arrive at step 0
# Fill initial batch
for _ in range(min(batch_size, len(queue))):
req_id = queue.pop(0)
active[req_id] = output_lengths[req_id]
while active or queue:
total_steps += 1
# One decode step: every active request generates one token
finished_ids = []
for req_id in list(active):
active[req_id] -= 1
if active[req_id] <= 0:
finished_ids.append(req_id)
# Remove finished, add from queue
for req_id in finished_ids:
del active[req_id]
latencies[req_id] = total_steps - arrival_step[req_id]
completed += 1
# Fill empty slots from queue
while len(active) < batch_size and queue:
req_id = queue.pop(0)
active[req_id] = output_lengths[req_id]
arrival_step[req_id] = total_steps
tokens_generated = sum(output_lengths)
throughput = tokens_generated / total_steps if total_steps > 0 else 0
avg_latency = sum(latencies.values()) / len(latencies) if latencies else 0
print(f"Continuous Batching (batch_size={batch_size})")
print(f" Requests: {num_requests}")
print(f" Total decode steps: {total_steps}")
print(f" Tokens generated: {tokens_generated:,}")
print(f" Throughput: {throughput:.1f} tokens/step")
print(f" Avg latency: {avg_latency:.0f} steps/request")
print(f" Completed: {completed}/{num_requests}")
return throughput
print("=== Variable output lengths (20-500 tokens) ===")
tp_continuous = simulate_continuous_batching(100, 8, 20, 500)
print()
print("=== Narrow output lengths (200-300 tokens) ===")
simulate_continuous_batching(100, 8, 200, 300)
Compare the throughput numbers: continuous batching keeps the GPU near full utilization because finished requests get replaced immediately. There's no padding — every decode step does useful work for every active request.
But there's a catch. Each request has its own KV cache that persists across iterations and grows with every generated token. When a request finishes, its memory can be freed, but when a new one starts, memory must be allocated. Managing this efficiently — without fragmentation, without waste — is the next challenge.
PagedAttention — Virtual Memory for AI
In 2023, Kwon et al. published a paper at SOSP that changed LLM serving forever. The idea was inspired by a concept every operating systems student learns: virtual memory.
Traditional KV cache allocation works like old-school contiguous memory: each request reserves a large block upfront based on the maximum possible sequence length. If your max is 4096 tokens but the request only generates 200, you've wasted 95% of that reservation. Across many concurrent requests, this waste is catastrophic — typical utilization is 20–40% of allocated VRAM.
PagedAttention treats KV cache memory like virtual pages. Memory is divided into fixed-size blocks (typically 16 tokens each). Each request gets a page table that maps logical token positions to physical memory blocks. Pages are allocated on demand as the sequence grows and freed immediately when the request completes.
The results speak for themselves:
| Metric | Contiguous Allocation | PagedAttention |
|---|---|---|
| Memory waste | 60–80% | < 4% |
| Concurrent requests (24GB GPU) | 3–4 | 12–20 |
| Fragmentation | Severe | Near-zero |
There's a bonus: prefix caching. If 100 requests share the same system prompt (say, 500 tokens), PagedAttention stores that prompt's KV cache once and shares the pages across all 100 requests via copy-on-write. That's 500 × 100 = 50,000 tokens of KV cache reduced to 500.
class PagedKVCacheManager:
"""Simplified PagedAttention memory manager with page tables."""
def __init__(self, total_pages, tokens_per_page=16):
self.total_pages = total_pages
self.tokens_per_page = tokens_per_page
self.free_pages = list(range(total_pages)) # Free page pool
self.page_tables = {} # request_id -> [page_indices]
self.ref_counts = [0] * total_pages # For shared prefix pages
def allocate_request(self, request_id, num_tokens):
"""Allocate pages for a new request. Pages grow on demand."""
pages_needed = (num_tokens + self.tokens_per_page - 1) // self.tokens_per_page
if pages_needed > len(self.free_pages):
return False # Out of memory
allocated = [self.free_pages.pop() for _ in range(pages_needed)]
for p in allocated:
self.ref_counts[p] = 1
self.page_tables[request_id] = allocated
return True
def grow_request(self, request_id, new_tokens):
"""Allocate additional pages as sequence grows."""
current_pages = len(self.page_tables[request_id])
current_capacity = current_pages * self.tokens_per_page
total_tokens = current_capacity + new_tokens
extra_pages = (total_tokens + self.tokens_per_page - 1) // self.tokens_per_page - current_pages
if extra_pages > len(self.free_pages):
return False
for _ in range(extra_pages):
page = self.free_pages.pop()
self.ref_counts[page] = 1
self.page_tables[request_id].append(page)
return True
def share_prefix(self, source_id, target_id, prefix_pages):
"""Share prefix pages between requests (copy-on-write)."""
shared = self.page_tables[source_id][:prefix_pages]
for p in shared:
self.ref_counts[p] += 1
self.page_tables[target_id] = shared.copy()
def free_request(self, request_id):
"""Free all pages for a completed request."""
for page in self.page_tables.pop(request_id, []):
self.ref_counts[page] -= 1
if self.ref_counts[page] == 0:
self.free_pages.append(page)
def utilization(self):
allocated = self.total_pages - len(self.free_pages)
return allocated / self.total_pages * 100
# Demo: compare contiguous vs paged allocation
total_gpu_pages = 256 # Represent 256 pages of KV cache memory
manager = PagedKVCacheManager(total_gpu_pages, tokens_per_page=16)
# Simulate 10 requests with variable lengths
import random
rng = random.Random(42)
request_lengths = [rng.randint(32, 512) for _ in range(10)]
print("PagedAttention Allocation:")
for i, length in enumerate(request_lengths):
success = manager.allocate_request(f"req_{i}", length)
pages_used = len(manager.page_tables.get(f"req_{i}", []))
print(f" req_{i}: {length} tokens -> {pages_used} pages {'[OK]' if success else '[OOM]'}")
print(f"\n Memory utilization: {manager.utilization():.1f}%")
print(f" Free pages: {len(manager.free_pages)}/{total_gpu_pages}")
# Now show prefix sharing
print("\nPrefix Sharing Demo:")
manager2 = PagedKVCacheManager(128, tokens_per_page=16)
manager2.allocate_request("base", 256) # 256-token system prompt = 16 pages
base_pages = len(manager2.page_tables["base"])
print(f" Base prompt: 256 tokens = {base_pages} pages")
for i in range(5):
manager2.share_prefix("base", f"shared_{i}", base_pages)
manager2.grow_request(f"shared_{i}", rng.randint(32, 128))
total = len(manager2.page_tables[f"shared_{i}"])
print(f" shared_{i}: {base_pages} shared + {total - base_pages} unique = {total} total pages")
print(f"\n With sharing: {manager2.utilization():.1f}% utilized")
print(f" Without sharing: would need {base_pages * 5} extra pages for duplicated prefixes")
This is what powers vLLM, the most popular open-source LLM serving engine. By combining continuous batching with PagedAttention, vLLM achieves 3–5x higher throughput than naive serving — and that's before any other optimizations.
Speculative Decoding in Production
If continuous batching and PagedAttention optimize throughput, speculative decoding attacks latency. We covered the algorithmic foundations in Speculative Decoding from Scratch — here's how it fits into a real serving stack.
The idea: use a small, fast draft model to propose K tokens at once, then verify them all in a single forward pass through the large target model. If the draft model matches the target 70% of the time with K=5, you generate ~3.5 tokens per target model forward pass instead of 1.
In a serving context, speculative decoding and continuous batching are complementary: speculative decoding reduces per-request latency (fewer forward passes needed), while continuous batching increases overall throughput (more concurrent requests). Modern serving engines like vLLM support both simultaneously.
Typical acceptance rates depend on how well-matched the draft model is:
- Generic small draft model (e.g., 68M params): 50–65% acceptance
- Distilled draft model (trained on target's outputs): 65–80% acceptance
- EAGLE-style (trained prediction head): 80%+ acceptance, up to 3x speedup
import random
def simulate_speculative_decoding(
total_tokens, draft_k, acceptance_rate,
draft_cost, target_cost, seed=42
):
"""
Simulate speculative decoding vs standard autoregressive generation.
Costs are in arbitrary time units per forward pass.
"""
rng = random.Random(seed)
# Standard autoregressive: one target forward pass per token
standard_time = total_tokens * target_cost
standard_passes = total_tokens
# Speculative: draft K tokens, verify in one target pass
spec_tokens = 0
spec_time = 0.0
spec_target_passes = 0
spec_draft_passes = 0
while spec_tokens < total_tokens:
# Draft phase: K forward passes through small model
spec_time += draft_k * draft_cost
spec_draft_passes += draft_k
# Verify phase: one forward pass through target model
spec_time += target_cost
spec_target_passes += 1
# Count accepted tokens (each has acceptance_rate probability)
accepted = 0
for i in range(draft_k):
if rng.random() < acceptance_rate:
accepted += 1
else:
break # First rejection stops the chain
# Always get at least 1 token (from the target model's correction)
tokens_this_round = accepted + 1
spec_tokens += tokens_this_round
speedup = standard_time / spec_time
tokens_per_pass = spec_tokens / spec_target_passes
print(f"Speculative Decoding (K={draft_k}, accept={acceptance_rate:.0%})")
print(f" Draft cost: {draft_cost}, Target cost: {target_cost}")
print(f" Tokens generated: {spec_tokens}")
print(f" Target passes: {spec_target_passes} (vs {standard_passes} standard)")
print(f" Tokens/target pass: {tokens_per_pass:.1f}")
print(f" Speedup: {speedup:.2f}x")
return speedup
print("=== Varying acceptance rates ===")
speedups = []
rates = [0.5, 0.6, 0.7, 0.8, 0.9]
for rate in rates:
s = simulate_speculative_decoding(
total_tokens=200, draft_k=5,
acceptance_rate=rate,
draft_cost=0.05, target_cost=1.0
)
speedups.append(s)
print()
print("Summary:")
print(f" {'Rate':-<12} {'Speedup':-<10}")
for rate, s in zip(rates, speedups):
bar = '#' * int(s * 10)
print(f" {rate:<12.0%} {s:<10.2f} {bar}")
The key insight from those numbers: speculative decoding pays for itself even at modest acceptance rates. A 60% acceptance rate with K=5 still yields a ~1.8x speedup because the draft model is 20x cheaper than the target. You're trading cheap draft passes for expensive target passes.
Benchmarks — Framework Comparison
Theory is nice, but numbers are what matter in production. Let's put all four approaches head-to-head on the same simulated workload: 200 requests, variable output lengths (20–1000 tokens), batch size 16.
import random
import statistics
def benchmark_serving_approaches(
num_requests=200, batch_size=16,
min_tokens=20, max_tokens=1000,
max_memory_pages=512, tokens_per_page=16,
seed=42
):
"""Benchmark four serving strategies on the same workload."""
rng = random.Random(seed)
output_lengths = [rng.randint(min_tokens, max_tokens) for _ in range(num_requests)]
total_tokens = sum(output_lengths)
results = {}
# 1. Naive sequential
naive_steps = sum(output_lengths)
naive_throughput = total_tokens / naive_steps # 1.0 tokens/step always
results["Naive Sequential"] = {
"throughput": naive_throughput,
"total_steps": naive_steps,
"avg_latency": statistics.mean(output_lengths),
"gpu_util": "~15%",
}
# 2. Static batching
static_steps = 0
static_latencies = []
for i in range(0, num_requests, batch_size):
batch = output_lengths[i:i + batch_size]
max_len = max(batch)
static_steps += max_len
static_latencies.extend([max_len] * len(batch))
results["Static Batching"] = {
"throughput": total_tokens / static_steps,
"total_steps": static_steps,
"avg_latency": statistics.mean(static_latencies),
"gpu_util": "~55%",
}
# 3. Continuous batching
queue = list(range(num_requests))
active = {}
cont_steps = 0
cont_latencies = {}
arrival = {i: 0 for i in range(num_requests)}
for _ in range(min(batch_size, len(queue))):
rid = queue.pop(0)
active[rid] = output_lengths[rid]
while active or queue:
cont_steps += 1
done = [r for r in active if active[r] <= 1]
for r in done:
del active[r]
cont_latencies[r] = cont_steps - arrival[r]
for r in list(active):
active[r] -= 1
while len(active) < batch_size and queue:
rid = queue.pop(0)
active[rid] = output_lengths[rid]
arrival[rid] = cont_steps
results["Continuous Batching"] = {
"throughput": total_tokens / cont_steps,
"total_steps": cont_steps,
"avg_latency": statistics.mean(cont_latencies.values()),
"gpu_util": "~78%",
}
# 4. PagedAttention + continuous batching (higher effective batch)
# PagedAttention allows ~3x more concurrent requests due to memory savings
effective_batch = min(batch_size * 3, num_requests)
queue2 = list(range(num_requests))
active2 = {}
paged_steps = 0
paged_latencies = {}
arrival2 = {i: 0 for i in range(num_requests)}
for _ in range(min(effective_batch, len(queue2))):
rid = queue2.pop(0)
active2[rid] = output_lengths[rid]
while active2 or queue2:
paged_steps += 1
done = [r for r in active2 if active2[r] <= 1]
for r in done:
del active2[r]
paged_latencies[r] = paged_steps - arrival2[r]
for r in list(active2):
active2[r] -= 1
while len(active2) < effective_batch and queue2:
rid = queue2.pop(0)
active2[rid] = output_lengths[rid]
arrival2[rid] = paged_steps
results["PagedAttention (vLLM)"] = {
"throughput": total_tokens / paged_steps,
"total_steps": paged_steps,
"avg_latency": statistics.mean(paged_latencies.values()),
"gpu_util": "~94%",
}
# Print comparison table
print(f"Benchmark: {num_requests} requests, {min_tokens}-{max_tokens} tokens, batch={batch_size}")
print(f"Total tokens to generate: {total_tokens:,}\n")
print(f"{'Approach':-<28} {'Tok/Step':>10} {'Steps':>10} {'Avg Lat':>10} {'GPU Util':>10}")
print("-" * 70)
for name, r in results.items():
print(f"{name:-<28} {r['throughput']:>10.1f} {r['total_steps']:>10,} "
f"{r['avg_latency']:>10.0f} {r['gpu_util']:>10}")
# Show scaling: throughput at different concurrency levels
print(f"\n{'Concurrency':-<15} {'Static':>10} {'Continuous':>12} {'Paged':>10}")
print("-" * 50)
for conc in [1, 4, 8, 16, 32, 64]:
s_tp = min(conc, batch_size) * 0.6 # Static: limited by batch, 60% efficient
c_tp = min(conc, batch_size) * 0.95 # Continuous: 95% efficient up to batch
p_tp = min(conc, batch_size * 3) * 0.95 # Paged: 3x effective batch
print(f"{conc:-<15} {s_tp:>10.1f} {c_tp:>12.1f} {p_tp:>10.1f}")
benchmark_serving_approaches()
The numbers tell a clear story. Naive sequential is the baseline: one token per step, always. Static batching helps with throughput but wastes compute on padding. Continuous batching eliminates padding waste. And PagedAttention unlocks the memory to run 3x more concurrent requests, pushing throughput to another level entirely.
Notice the scaling table at the bottom: static batching tops out at the configured batch size regardless of demand. Continuous batching is the same story — it's more efficient within the batch, but still capped by memory. PagedAttention raises that ceiling dramatically.
Try It: Batch Scheduling Visualizer
Production Serving Checklist
Knowing the theory is one thing. Here's what actually matters when you're putting an LLM into production:
Model Selection
- Quantize aggressively. INT8 halves your memory, INT4 quarters it. A 7B model at INT4 fits in 3.5GB instead of 14GB. See our quantization post for the math.
- Right-size the model. A well-prompted 7B model often matches a lazy 70B model for specific tasks. Benchmark your actual workload before committing to a larger model.
Hardware Sizing
- VRAM budget: Model weights + KV cache × max_concurrent_requests + ~2GB overhead. Use the calculator below to estimate.
- Bandwidth matters: Decode is memory-bandwidth-bound. An A100 (2TB/s) serves tokens ~3x faster than an A10G (600GB/s) even though compute is only 2x.
Batching Configuration
- Max batch size: Set based on VRAM, not intuition. Too low = wasted GPU. Too high = OOM.
- Max waiting time: How long a request waits before being scheduled. Lower = better latency, higher = better batching efficiency. 100–500ms is a typical sweet spot.
Monitoring
- Tokens/sec (throughput) — the north star metric
- Queue depth — leading indicator of capacity problems
- P99 latency — tail latency catches stragglers
- GPU utilization — should be 80%+ in steady state
- KV cache utilization — how close you are to memory limits
Autoscaling & Resilience
- Scale trigger: Queue depth exceeding 2x batch size for > 30 seconds
- Graceful degradation: Request queuing with timeouts (30s typical), load shedding at 90% capacity
- Cost optimization: Spot/preemptible instances for overflow traffic, request routing by model size (simple queries → smaller model)
Try It: GPU Memory Budget Calculator
Conclusion
LLM serving is a memory management problem disguised as a compute problem. The evolution from naive sequential serving to vLLM-style PagedAttention is really a story about eliminating waste: wasted padding in static batches, wasted idle slots in batch scheduling, wasted VRAM in contiguous allocation.
The four strategies build on each other:
- Naive sequential: Process one request at a time. Simple, slow, ~15% GPU utilization.
- Static batching: Process N requests together. 5x better, but 40–60% padding waste.
- Continuous batching: Schedule at the iteration level. 2–3x better than static, near-zero waste.
- PagedAttention: Virtual memory for KV cache. 3–5x more concurrent requests, < 4% memory waste.
And speculative decoding cuts per-request latency orthogonally to all of them.
The good news: you don't have to implement any of this yourself. vLLM gives you continuous batching + PagedAttention out of the box. SGLang adds RadixAttention for even smarter prefix caching. TensorRT-LLM optimizes for NVIDIA hardware. The hard engineering is done — your job is to understand why these systems work so you can configure them correctly and debug them when they don't.
References & Further Reading
- Kwon et al. — Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP 2023) — The vLLM paper that introduced PagedAttention
- Yu et al. — Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI 2022) — Introduced iteration-level scheduling (continuous batching)
- Leviathan et al. — Fast Inference from Transformers via Speculative Decoding (ICML 2023) — Formal foundations of speculative decoding
- Zheng et al. — SGLang: Efficient Execution of Structured Language Model Programs (NeurIPS 2024) — RadixAttention and structured generation optimization
- DadOps — KV Cache from Scratch — Deep dive into KV cache memory mechanics
- DadOps — Flash Attention from Scratch — Compute optimization via memory-efficient attention
- DadOps — Speculative Decoding from Scratch — The draft-verify algorithm explained
- DadOps — Quantization from Scratch — Model compression techniques for efficient serving