Serving LLMs at Scale: From Naive to vLLM

February 26, 2026 · Backend · 14 min read

The Naive Serving Problem

Serving a ResNet classifier is a solved problem. A single forward pass takes about 5ms, the model weighs ~100MB, and you can handle thousands of requests per second on modest hardware. Serving an LLM is a completely different beast.

The core issue is that LLM inference is autoregressive: each token depends on all previous tokens. You can't compute the whole output in one shot. A single request might take 5–60 seconds, and memory grows dynamically as the KV cache expands with every generated token.

Let's make this concrete with Llama 2 7B. Model weights at FP16 occupy 7 billion parameters × 2 bytes = 14GB. Now for the KV cache — at a sequence length of 4096 tokens, each request needs:

batch × seq_len × 2 (K+V) × num_layers × head_dim × num_heads × sizeof(FP16)
= 1 × 4096 × 2 × 32 × 128 × 32 × 2 bytes ≈ 2GB per request

On a single 24GB GPU: 14GB for weights + 2GB for one request's KV cache = 16GB. You have room for maybe four concurrent requests before hitting the VRAM ceiling. If each request takes 10 seconds, that's 24 requests per minute. Production needs 600+.

That's a 25x gap — and it only gets worse with longer sequences or larger models. If you're not already familiar with KV cache mechanics, our KV Cache from Scratch post covers the memory math in detail. And if you want to shrink those weights, Quantization from Scratch shows you how to cut model size by 2–4x.

This post is about closing that throughput gap. We'll walk through four increasingly sophisticated serving strategies, benchmark them head-to-head, and build interactive visualizations so you can see exactly where the bottlenecks are.

Static Batching — The Obvious First Try

The simplest optimization is familiar from any ML serving context: batch your requests. Collect N incoming requests, pack them into a single batch, and process them together on the GPU. The matrix multiplications get amortized across the batch, and you squeeze more useful work out of each GPU cycle.

With a batch size of 8, you might see a 5x throughput improvement (not 8x — the decode phase is memory-bandwidth-bound, not compute-bound, so batching helps less than in pure compute workloads like image classification).

But static batching has a fatal flaw: padding waste. Imagine four requests arrive with output lengths [50, 100, 500, 2000] tokens. In a static batch, all four must run until the longest finishes. The first three requests are done generating but sit idle — and the GPU cycles spent on their padding positions are pure waste.

Let's measure exactly how bad this gets:

import random
import statistics

def simulate_static_batching(num_requests, batch_size, min_tokens, max_tokens, seed=42):
    """Simulate a static batching LLM server and measure waste."""
    rng = random.Random(seed)
    output_lengths = [rng.randint(min_tokens, max_tokens) for _ in range(num_requests)]

    total_useful_steps = 0
    total_padded_steps = 0
    request_latencies = []

    for batch_start in range(0, num_requests, batch_size):
        batch = output_lengths[batch_start:batch_start + batch_size]
        max_len = max(batch)

        for length in batch:
            total_useful_steps += length
            total_padded_steps += max_len - length
            # Every request in the batch waits for the longest one
            request_latencies.append(max_len)

    waste_pct = total_padded_steps / (total_useful_steps + total_padded_steps) * 100
    throughput = total_useful_steps / sum(request_latencies) * batch_size

    print(f"Static Batching (batch_size={batch_size})")
    print(f"  Requests: {num_requests}")
    print(f"  Output lengths: {min_tokens}-{max_tokens} tokens")
    print(f"  Useful tokens:  {total_useful_steps:,}")
    print(f"  Padded tokens:  {total_padded_steps:,}")
    print(f"  Padding waste:  {waste_pct:.1f}%")
    print(f"  Avg latency:    {statistics.mean(request_latencies):.0f} steps/request")
    print(f"  Relative throughput: {throughput:.2f} tokens/step")

# Wide variance in output length = lots of waste
simulate_static_batching(100, batch_size=8, min_tokens=20, max_tokens=500)
print()
# Narrow variance = less waste, but still blocked
simulate_static_batching(100, batch_size=8, min_tokens=200, max_tokens=300)

Run that and you'll see padding waste of 40–60% with variable output lengths. Even when the variance is small, those shorter requests in each batch are stuck waiting. Static batching achieves 40–60% GPU utilization at best.

The insight that breaks through this ceiling? Don't treat the batch as a monolith.

Continuous Batching — The Scheduling Revolution

The breakthrough came from a 2022 paper by Yu et al. called Orca, published at OSDI. The key idea is deceptively simple: instead of waiting for the entire batch to finish, schedule at the iteration level.

At every single decode step, the scheduler checks: has any request hit its EOS token or max length? If so, evict it from the batch and immediately slot in a waiting request from the queue. The GPU never idles while there's work to do.

Think of it like a restaurant with 8 tables (batch size 8). Static batching seats 8 guests, waits for all 8 to finish dessert, then clears and seats the next 8. Continuous batching seats a new guest the moment a table opens. Same kitchen, dramatically higher throughput.

The numbers are striking: on workloads with variable output lengths (20–500 tokens), continuous batching delivers 2–3x higher throughput than static batching. The improvement is directly proportional to the variance in output lengths — the more variable your traffic, the bigger the win.

import random

def simulate_continuous_batching(num_requests, batch_size, min_tokens, max_tokens, seed=42):
    """Simulate iteration-level scheduling with a request queue."""
    rng = random.Random(seed)
    output_lengths = [rng.randint(min_tokens, max_tokens) for _ in range(num_requests)]

    queue = list(range(num_requests))  # Request IDs waiting
    active = {}   # {request_id: remaining_tokens}
    completed = 0
    total_steps = 0
    latencies = {}  # request_id -> (arrival_step, completion_step)
    arrival_step = {i: 0 for i in range(num_requests)}  # All arrive at step 0

    # Fill initial batch
    for _ in range(min(batch_size, len(queue))):
        req_id = queue.pop(0)
        active[req_id] = output_lengths[req_id]

    while active or queue:
        total_steps += 1

        # One decode step: every active request generates one token
        finished_ids = []
        for req_id in list(active):
            active[req_id] -= 1
            if active[req_id] <= 0:
                finished_ids.append(req_id)

        # Remove finished, add from queue
        for req_id in finished_ids:
            del active[req_id]
            latencies[req_id] = total_steps - arrival_step[req_id]
            completed += 1

        # Fill empty slots from queue
        while len(active) < batch_size and queue:
            req_id = queue.pop(0)
            active[req_id] = output_lengths[req_id]
            arrival_step[req_id] = total_steps

    tokens_generated = sum(output_lengths)
    throughput = tokens_generated / total_steps if total_steps > 0 else 0
    avg_latency = sum(latencies.values()) / len(latencies) if latencies else 0

    print(f"Continuous Batching (batch_size={batch_size})")
    print(f"  Requests: {num_requests}")
    print(f"  Total decode steps: {total_steps}")
    print(f"  Tokens generated:   {tokens_generated:,}")
    print(f"  Throughput:   {throughput:.1f} tokens/step")
    print(f"  Avg latency:  {avg_latency:.0f} steps/request")
    print(f"  Completed:    {completed}/{num_requests}")
    return throughput

print("=== Variable output lengths (20-500 tokens) ===")
tp_continuous = simulate_continuous_batching(100, 8, 20, 500)
print()
print("=== Narrow output lengths (200-300 tokens) ===")
simulate_continuous_batching(100, 8, 200, 300)

Compare the throughput numbers: continuous batching keeps the GPU near full utilization because finished requests get replaced immediately. There's no padding — every decode step does useful work for every active request.

But there's a catch. Each request has its own KV cache that persists across iterations and grows with every generated token. When a request finishes, its memory can be freed, but when a new one starts, memory must be allocated. Managing this efficiently — without fragmentation, without waste — is the next challenge.

PagedAttention — Virtual Memory for AI

In 2023, Kwon et al. published a paper at SOSP that changed LLM serving forever. The idea was inspired by a concept every operating systems student learns: virtual memory.

Traditional KV cache allocation works like old-school contiguous memory: each request reserves a large block upfront based on the maximum possible sequence length. If your max is 4096 tokens but the request only generates 200, you've wasted 95% of that reservation. Across many concurrent requests, this waste is catastrophic — typical utilization is 20–40% of allocated VRAM.

PagedAttention treats KV cache memory like virtual pages. Memory is divided into fixed-size blocks (typically 16 tokens each). Each request gets a page table that maps logical token positions to physical memory blocks. Pages are allocated on demand as the sequence grows and freed immediately when the request completes.

The results speak for themselves:

Metric	Contiguous Allocation	PagedAttention
Memory waste	60–80%	< 4%
Concurrent requests (24GB GPU)	3–4	12–20
Fragmentation	Severe	Near-zero

There's a bonus: prefix caching. If 100 requests share the same system prompt (say, 500 tokens), PagedAttention stores that prompt's KV cache once and shares the pages across all 100 requests via copy-on-write. That's 500 × 100 = 50,000 tokens of KV cache reduced to 500.

class PagedKVCacheManager:
    """Simplified PagedAttention memory manager with page tables."""

    def __init__(self, total_pages, tokens_per_page=16):
        self.total_pages = total_pages
        self.tokens_per_page = tokens_per_page
        self.free_pages = list(range(total_pages))  # Free page pool
        self.page_tables = {}    # request_id -> [page_indices]
        self.ref_counts = [0] * total_pages  # For shared prefix pages

    def allocate_request(self, request_id, num_tokens):
        """Allocate pages for a new request. Pages grow on demand."""
        pages_needed = (num_tokens + self.tokens_per_page - 1) // self.tokens_per_page
        if pages_needed > len(self.free_pages):
            return False  # Out of memory
        allocated = [self.free_pages.pop() for _ in range(pages_needed)]
        for p in allocated:
            self.ref_counts[p] = 1
        self.page_tables[request_id] = allocated
        return True

    def grow_request(self, request_id, new_tokens):
        """Allocate additional pages as sequence grows."""
        current_pages = len(self.page_tables[request_id])
        current_capacity = current_pages * self.tokens_per_page
        total_tokens = current_capacity + new_tokens
        extra_pages = (total_tokens + self.tokens_per_page - 1) // self.tokens_per_page - current_pages
        if extra_pages > len(self.free_pages):
            return False
        for _ in range(extra_pages):
            page = self.free_pages.pop()
            self.ref_counts[page] = 1
            self.page_tables[request_id].append(page)
        return True

    def share_prefix(self, source_id, target_id, prefix_pages):
        """Share prefix pages between requests (copy-on-write)."""
        shared = self.page_tables[source_id][:prefix_pages]
        for p in shared:
            self.ref_counts[p] += 1
        self.page_tables[target_id] = shared.copy()

    def free_request(self, request_id):
        """Free all pages for a completed request."""
        for page in self.page_tables.pop(request_id, []):
            self.ref_counts[page] -= 1
            if self.ref_counts[page] == 0:
                self.free_pages.append(page)

    def utilization(self):
        allocated = self.total_pages - len(self.free_pages)
        return allocated / self.total_pages * 100

# Demo: compare contiguous vs paged allocation
total_gpu_pages = 256  # Represent 256 pages of KV cache memory

manager = PagedKVCacheManager(total_gpu_pages, tokens_per_page=16)

# Simulate 10 requests with variable lengths
import random
rng = random.Random(42)
request_lengths = [rng.randint(32, 512) for _ in range(10)]

print("PagedAttention Allocation:")
for i, length in enumerate(request_lengths):
    success = manager.allocate_request(f"req_{i}", length)
    pages_used = len(manager.page_tables.get(f"req_{i}", []))
    print(f"  req_{i}: {length} tokens -> {pages_used} pages {'[OK]' if success else '[OOM]'}")

print(f"\n  Memory utilization: {manager.utilization():.1f}%")
print(f"  Free pages: {len(manager.free_pages)}/{total_gpu_pages}")

# Now show prefix sharing
print("\nPrefix Sharing Demo:")
manager2 = PagedKVCacheManager(128, tokens_per_page=16)
manager2.allocate_request("base", 256)  # 256-token system prompt = 16 pages
base_pages = len(manager2.page_tables["base"])
print(f"  Base prompt: 256 tokens = {base_pages} pages")

for i in range(5):
    manager2.share_prefix("base", f"shared_{i}", base_pages)
    manager2.grow_request(f"shared_{i}", rng.randint(32, 128))
    total = len(manager2.page_tables[f"shared_{i}"])
    print(f"  shared_{i}: {base_pages} shared + {total - base_pages} unique = {total} total pages")

print(f"\n  With sharing:    {manager2.utilization():.1f}% utilized")
print(f"  Without sharing: would need {base_pages * 5} extra pages for duplicated prefixes")

This is what powers vLLM, the most popular open-source LLM serving engine. By combining continuous batching with PagedAttention, vLLM achieves 3–5x higher throughput than naive serving — and that's before any other optimizations.

Speculative Decoding in Production

If continuous batching and PagedAttention optimize throughput, speculative decoding attacks latency. We covered the algorithmic foundations in Speculative Decoding from Scratch — here's how it fits into a real serving stack.

The idea: use a small, fast draft model to propose K tokens at once, then verify them all in a single forward pass through the large target model. If the draft model matches the target 70% of the time with K=5, you generate ~3.5 tokens per target model forward pass instead of 1.

In a serving context, speculative decoding and continuous batching are complementary: speculative decoding reduces per-request latency (fewer forward passes needed), while continuous batching increases overall throughput (more concurrent requests). Modern serving engines like vLLM support both simultaneously.

Typical acceptance rates depend on how well-matched the draft model is:

Generic small draft model (e.g., 68M params): 50–65% acceptance
Distilled draft model (trained on target's outputs): 65–80% acceptance
EAGLE-style (trained prediction head): 80%+ acceptance, up to 3x speedup

import random

def simulate_speculative_decoding(
    total_tokens, draft_k, acceptance_rate,
    draft_cost, target_cost, seed=42
):
    """
    Simulate speculative decoding vs standard autoregressive generation.
    Costs are in arbitrary time units per forward pass.
    """
    rng = random.Random(seed)

    # Standard autoregressive: one target forward pass per token
    standard_time = total_tokens * target_cost
    standard_passes = total_tokens

    # Speculative: draft K tokens, verify in one target pass
    spec_tokens = 0
    spec_time = 0.0
    spec_target_passes = 0
    spec_draft_passes = 0

    while spec_tokens < total_tokens:
        # Draft phase: K forward passes through small model
        spec_time += draft_k * draft_cost
        spec_draft_passes += draft_k

        # Verify phase: one forward pass through target model
        spec_time += target_cost
        spec_target_passes += 1

        # Count accepted tokens (each has acceptance_rate probability)
        accepted = 0
        for i in range(draft_k):
            if rng.random() < acceptance_rate:
                accepted += 1
            else:
                break  # First rejection stops the chain
        # Always get at least 1 token (from the target model's correction)
        tokens_this_round = accepted + 1
        spec_tokens += tokens_this_round

    speedup = standard_time / spec_time
    tokens_per_pass = spec_tokens / spec_target_passes

    print(f"Speculative Decoding (K={draft_k}, accept={acceptance_rate:.0%})")
    print(f"  Draft cost: {draft_cost}, Target cost: {target_cost}")
    print(f"  Tokens generated: {spec_tokens}")
    print(f"  Target passes:    {spec_target_passes} (vs {standard_passes} standard)")
    print(f"  Tokens/target pass: {tokens_per_pass:.1f}")
    print(f"  Speedup: {speedup:.2f}x")
    return speedup

print("=== Varying acceptance rates ===")
speedups = []
rates = [0.5, 0.6, 0.7, 0.8, 0.9]
for rate in rates:
    s = simulate_speculative_decoding(
        total_tokens=200, draft_k=5,
        acceptance_rate=rate,
        draft_cost=0.05, target_cost=1.0
    )
    speedups.append(s)
    print()

print("Summary:")
print(f"  {'Rate':-<12} {'Speedup':-<10}")
for rate, s in zip(rates, speedups):
    bar = '#' * int(s * 10)
    print(f"  {rate:<12.0%} {s:<10.2f} {bar}")

The key insight from those numbers: speculative decoding pays for itself even at modest acceptance rates. A 60% acceptance rate with K=5 still yields a ~1.8x speedup because the draft model is 20x cheaper than the target. You're trading cheap draft passes for expensive target passes.

Benchmarks — Framework Comparison

Theory is nice, but numbers are what matter in production. Let's put all four approaches head-to-head on the same simulated workload: 200 requests, variable output lengths (20–1000 tokens), batch size 16.

import random
import statistics

def benchmark_serving_approaches(
    num_requests=200, batch_size=16,
    min_tokens=20, max_tokens=1000,
    max_memory_pages=512, tokens_per_page=16,
    seed=42
):
    """Benchmark four serving strategies on the same workload."""
    rng = random.Random(seed)
    output_lengths = [rng.randint(min_tokens, max_tokens) for _ in range(num_requests)]
    total_tokens = sum(output_lengths)

    results = {}

    # 1. Naive sequential
    naive_steps = sum(output_lengths)
    naive_throughput = total_tokens / naive_steps  # 1.0 tokens/step always
    results["Naive Sequential"] = {
        "throughput": naive_throughput,
        "total_steps": naive_steps,
        "avg_latency": statistics.mean(output_lengths),
        "gpu_util": "~15%",
    }

    # 2. Static batching
    static_steps = 0
    static_latencies = []
    for i in range(0, num_requests, batch_size):
        batch = output_lengths[i:i + batch_size]
        max_len = max(batch)
        static_steps += max_len
        static_latencies.extend([max_len] * len(batch))

    results["Static Batching"] = {
        "throughput": total_tokens / static_steps,
        "total_steps": static_steps,
        "avg_latency": statistics.mean(static_latencies),
        "gpu_util": "~55%",
    }

    # 3. Continuous batching
    queue = list(range(num_requests))
    active = {}
    cont_steps = 0
    cont_latencies = {}
    arrival = {i: 0 for i in range(num_requests)}

    for _ in range(min(batch_size, len(queue))):
        rid = queue.pop(0)
        active[rid] = output_lengths[rid]

    while active or queue:
        cont_steps += 1
        done = [r for r in active if active[r] <= 1]
        for r in done:
            del active[r]
            cont_latencies[r] = cont_steps - arrival[r]
        for r in list(active):
            active[r] -= 1
        while len(active) < batch_size and queue:
            rid = queue.pop(0)
            active[rid] = output_lengths[rid]
            arrival[rid] = cont_steps

    results["Continuous Batching"] = {
        "throughput": total_tokens / cont_steps,
        "total_steps": cont_steps,
        "avg_latency": statistics.mean(cont_latencies.values()),
        "gpu_util": "~78%",
    }

    # 4. PagedAttention + continuous batching (higher effective batch)
    # PagedAttention allows ~3x more concurrent requests due to memory savings
    effective_batch = min(batch_size * 3, num_requests)
    queue2 = list(range(num_requests))
    active2 = {}
    paged_steps = 0
    paged_latencies = {}
    arrival2 = {i: 0 for i in range(num_requests)}

    for _ in range(min(effective_batch, len(queue2))):
        rid = queue2.pop(0)
        active2[rid] = output_lengths[rid]

    while active2 or queue2:
        paged_steps += 1
        done = [r for r in active2 if active2[r] <= 1]
        for r in done:
            del active2[r]
            paged_latencies[r] = paged_steps - arrival2[r]
        for r in list(active2):
            active2[r] -= 1
        while len(active2) < effective_batch and queue2:
            rid = queue2.pop(0)
            active2[rid] = output_lengths[rid]
            arrival2[rid] = paged_steps

    results["PagedAttention (vLLM)"] = {
        "throughput": total_tokens / paged_steps,
        "total_steps": paged_steps,
        "avg_latency": statistics.mean(paged_latencies.values()),
        "gpu_util": "~94%",
    }

    # Print comparison table
    print(f"Benchmark: {num_requests} requests, {min_tokens}-{max_tokens} tokens, batch={batch_size}")
    print(f"Total tokens to generate: {total_tokens:,}\n")
    print(f"{'Approach':-<28} {'Tok/Step':>10} {'Steps':>10} {'Avg Lat':>10} {'GPU Util':>10}")
    print("-" * 70)
    for name, r in results.items():
        print(f"{name:-<28} {r['throughput']:>10.1f} {r['total_steps']:>10,} "
              f"{r['avg_latency']:>10.0f} {r['gpu_util']:>10}")

    # Show scaling: throughput at different concurrency levels
    print(f"\n{'Concurrency':-<15} {'Static':>10} {'Continuous':>12} {'Paged':>10}")
    print("-" * 50)
    for conc in [1, 4, 8, 16, 32, 64]:
        s_tp = min(conc, batch_size) * 0.6  # Static: limited by batch, 60% efficient
        c_tp = min(conc, batch_size) * 0.95  # Continuous: 95% efficient up to batch
        p_tp = min(conc, batch_size * 3) * 0.95  # Paged: 3x effective batch
        print(f"{conc:-<15} {s_tp:>10.1f} {c_tp:>12.1f} {p_tp:>10.1f}")

benchmark_serving_approaches()

The numbers tell a clear story. Naive sequential is the baseline: one token per step, always. Static batching helps with throughput but wastes compute on padding. Continuous batching eliminates padding waste. And PagedAttention unlocks the memory to run 3x more concurrent requests, pushing throughput to another level entirely.

Notice the scaling table at the bottom: static batching tops out at the configured batch size regardless of demand. Continuous batching is the same story — it's more efficient within the batch, but still capped by memory. PagedAttention raises that ceiling dramatically.

Try It: Batch Scheduling Visualizer

Strategy: Arrival Rate: 5/s Length Variance: 7 Batch Size: 6

Throughput: 0 tok/s GPU Util: 0% Avg Latency: 0 steps Waste: 0%

Production Serving Checklist

Knowing the theory is one thing. Here's what actually matters when you're putting an LLM into production:

Model Selection

Quantize aggressively. INT8 halves your memory, INT4 quarters it. A 7B model at INT4 fits in 3.5GB instead of 14GB. See our quantization post for the math.
Right-size the model. A well-prompted 7B model often matches a lazy 70B model for specific tasks. Benchmark your actual workload before committing to a larger model.

Hardware Sizing

VRAM budget: Model weights + KV cache × max_concurrent_requests + ~2GB overhead. Use the calculator below to estimate.
Bandwidth matters: Decode is memory-bandwidth-bound. An A100 (2TB/s) serves tokens ~3x faster than an A10G (600GB/s) even though compute is only 2x.

Batching Configuration

Max batch size: Set based on VRAM, not intuition. Too low = wasted GPU. Too high = OOM.
Max waiting time: How long a request waits before being scheduled. Lower = better latency, higher = better batching efficiency. 100–500ms is a typical sweet spot.

Monitoring

Tokens/sec (throughput) — the north star metric
Queue depth — leading indicator of capacity problems
P99 latency — tail latency catches stragglers
GPU utilization — should be 80%+ in steady state
KV cache utilization — how close you are to memory limits

Autoscaling & Resilience

Scale trigger: Queue depth exceeding 2x batch size for > 30 seconds
Graceful degradation: Request queuing with timeouts (30s typical), load shedding at 90% capacity
Cost optimization: Spot/preemptible instances for overflow traffic, request routing by model size (simple queries → smaller model)

Try It: GPU Memory Budget Calculator

Model Size: Precision: Max Seq Len: 4096 Concurrent Requests: 8

Model Weights: 0 GB KV Cache: 0 GB Total: 0 GB Fits on: ?

Conclusion

LLM serving is a memory management problem disguised as a compute problem. The evolution from naive sequential serving to vLLM-style PagedAttention is really a story about eliminating waste: wasted padding in static batches, wasted idle slots in batch scheduling, wasted VRAM in contiguous allocation.

The four strategies build on each other:

Naive sequential: Process one request at a time. Simple, slow, ~15% GPU utilization.
Static batching: Process N requests together. 5x better, but 40–60% padding waste.
Continuous batching: Schedule at the iteration level. 2–3x better than static, near-zero waste.
PagedAttention: Virtual memory for KV cache. 3–5x more concurrent requests, < 4% memory waste.

And speculative decoding cuts per-request latency orthogonally to all of them.

The good news: you don't have to implement any of this yourself. vLLM gives you continuous batching + PagedAttention out of the box. SGLang adds RadixAttention for even smarter prefix caching. TensorRT-LLM optimizes for NVIDIA hardware. The hard engineering is done — your job is to understand why these systems work so you can configure them correctly and debug them when they don't.

References & Further Reading

Kwon et al. — Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP 2023) — The vLLM paper that introduced PagedAttention
Yu et al. — Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI 2022) — Introduced iteration-level scheduling (continuous batching)
Leviathan et al. — Fast Inference from Transformers via Speculative Decoding (ICML 2023) — Formal foundations of speculative decoding
Zheng et al. — SGLang: Efficient Execution of Structured Language Model Programs (NeurIPS 2024) — RadixAttention and structured generation optimization
DadOps — KV Cache from Scratch — Deep dive into KV cache memory mechanics
DadOps — Flash Attention from Scratch — Compute optimization via memory-efficient attention
DadOps — Speculative Decoding from Scratch — The draft-verify algorithm explained
DadOps — Quantization from Scratch — Model compression techniques for efficient serving