← Back to Blog

GPU Memory Benchmarks: Will This Model Fit?

The GPU Memory Budget — Where Does VRAM Go?

You have a 24GB GPU and want to fine-tune a 7B parameter model. Quick math: 7 billion × 4 bytes = 28GB. It doesn't fit. But in half-precision: 7 billion × 2 bytes = 14GB. It fits! Except… during training, the optimizer stores two additional copies of every parameter (momentum and variance in Adam), gradients consume another copy's worth of memory, and activations scale with your batch size. Suddenly you need 80GB+. Where did all that memory go?

This is the question every ML practitioner asks before every experiment, and getting the answer wrong means wasted hours staring at CUDA out of memory errors. In this post, we'll profile GPU memory usage across model sizes, precision levels, batch sizes, and training stages. By the end, you'll have concrete numbers and practical formulas to predict memory requirements before hitting OOM.

GPU memory breaks down into four main consumers:

  1. Model parameters — the weights themselves. A 7B model in FP16 takes 14GB. Straightforward.
  2. Optimizer states — Adam stores two buffers per parameter (first and second moment estimates), both in FP32 even when the model is FP16. That's 2 × 7B × 4 bytes = 56GB of optimizer state for our 7B model. With mixed-precision training, you also keep an FP32 master copy of the weights, adding another 28GB.
  3. Gradients — same shape as parameters. In FP16 training: 14GB.
  4. Activations — intermediate tensors saved during the forward pass so the backward pass can compute gradients. This is the sneaky one: it scales with batch_size × sequence_length × hidden_dimension, and for large batches it can dwarf everything else.

The master formula:

Total VRAM = Parameters + Optimizer States + Gradients + Activations + KV Cache (inference)

Let's put real numbers on each of these. (If you want to understand why we use different precisions, see Quantization from Scratch. For parameter-efficient alternatives to full fine-tuning, see LoRA from Scratch.)

Benchmarking Setup — Measuring What You Can't See

Before we can profile anything, we need a reliable way to measure GPU memory at each stage of a model's lifecycle. PyTorch exposes several CUDA memory APIs, but they measure different things. Here's a profiling harness that captures the full picture:

import torch
import contextlib

@contextlib.contextmanager
def gpu_memory_tracker(stage_name, log):
    """Track GPU memory before, during, and after a stage."""
    torch.cuda.synchronize()
    torch.cuda.reset_peak_memory_stats()
    mem_before = torch.cuda.memory_allocated()

    yield  # Run the profiled code

    torch.cuda.synchronize()
    mem_after = torch.cuda.memory_allocated()
    mem_peak = torch.cuda.max_memory_allocated()

    log.append({
        "stage": stage_name,
        "before_mb": mem_before / 1e6,
        "after_mb": mem_after / 1e6,
        "peak_mb": mem_peak / 1e6,
        "delta_mb": (mem_after - mem_before) / 1e6,
    })


def profile_model(model_cls, input_fn, optimizer_cls=None):
    """Profile memory across model lifecycle stages."""
    log = []

    with gpu_memory_tracker("Model Loading", log):
        model = model_cls().cuda()

    x = input_fn()  # Create input on GPU

    with gpu_memory_tracker("Forward Pass", log):
        output = model(x)
        loss = output.sum()

    if optimizer_cls:
        optimizer = optimizer_cls(model.parameters())
        with gpu_memory_tracker("Backward Pass", log):
            loss.backward()

        with gpu_memory_tracker("Optimizer Step", log):
            optimizer.step()
            optimizer.zero_grad()

    # Print breakdown table
    print(f"{'Stage':-<20} {'Before':>10} {'After':>10} {'Peak':>10} {'Delta':>10}")
    print("-" * 62)
    for entry in log:
        print(f"{entry['stage']:-<20} {entry['before_mb']:>9.1f}M "
              f"{entry['after_mb']:>9.1f}M {entry['peak_mb']:>9.1f}M "
              f"{entry['delta_mb']:>+9.1f}M")
    return log

A critical pitfall: torch.cuda.memory_allocated() only tracks memory managed by PyTorch's caching allocator. CUDA kernels, cuDNN workspace buffers, and memory fragmentation can add 10–20% overhead. Always cross-check with nvidia-smi for ground truth. The gap between memory_allocated and memory_reserved shows you how much the caching allocator is holding in reserve — memory that's allocated from CUDA but not currently used by any tensor.

We use torch.cuda.synchronize() before every measurement because CUDA operations are asynchronous. Without it, you'd be reading stale numbers from before the GPU finished its work.

Inference Memory — Model Size × Precision

Inference is the simpler case. No gradients, no optimizer states — just the model parameters and any runtime buffers. But the relationship between model size and memory isn't always linear, because precision changes the bytes-per-parameter and the KV cache adds a batch-dependent component that can surprise you.

import torch
from transformers import AutoModelForCausalLM

def benchmark_inference_memory(model_name, precisions, batch_sizes, seq_len=512):
    """Measure inference memory across precisions and batch sizes."""
    results = []

    for precision in precisions:
        dtype_map = {
            "FP32": torch.float32,
            "FP16": torch.float16,
            "BF16": torch.bfloat16,
        }
        load_kwargs = {}
        if precision in dtype_map:
            load_kwargs["torch_dtype"] = dtype_map[precision]
        elif precision == "INT8":
            load_kwargs["load_in_8bit"] = True
        elif precision == "INT4":
            load_kwargs["load_in_4bit"] = True

        torch.cuda.empty_cache()
        torch.cuda.reset_peak_memory_stats()

        model = AutoModelForCausalLM.from_pretrained(
            model_name, device_map="auto", **load_kwargs
        )
        model_mem = torch.cuda.memory_allocated() / 1e9

        for bs in batch_sizes:
            torch.cuda.reset_peak_memory_stats()
            input_ids = torch.randint(0, 1000, (bs, seq_len), device="cuda")

            with torch.no_grad():
                _ = model.generate(input_ids, max_new_tokens=1)

            peak_mem = torch.cuda.max_memory_allocated() / 1e9
            results.append({
                "precision": precision,
                "batch_size": bs,
                "model_gb": round(model_mem, 2),
                "peak_gb": round(peak_mem, 2),
                "kv_cache_gb": round(peak_mem - model_mem, 2),
            })
        del model

    return results

Here's what the numbers look like for a 7B-parameter model:

Precision Model Memory Peak (bs=1, 512 tokens) Peak (bs=8, 2048 tokens)
FP32 28.0 GB 29.1 GB 44.5 GB
FP16 14.0 GB 14.5 GB 22.2 GB
BF16 14.0 GB 14.5 GB 22.2 GB
INT8 7.0 GB 7.5 GB 15.2 GB
INT4 3.5 GB 4.0 GB 11.7 GB

Notice that at batch_size=1 with short sequences, precision dominates. FP16 halves memory with negligible quality loss. INT8 and INT4 go further with increasingly noticeable quality trade-offs. But at batch_size=8 with 2048 tokens, the KV cache eats 8+ GB regardless of model precision — the cache stores keys and values for every layer and every attention head at the sequence's full precision. (For a deep dive on this, see KV Cache from Scratch.)

The KV cache formula for a standard transformer:

KV cache = 2 × num_layers × num_heads × head_dim × bytes_per_param × batch_size × seq_length

For our 7B model (32 layers, 32 heads, 128-dim heads, FP16): each token costs 2 × 32 × 32 × 128 × 2 = 524,288 bytes ≈ 0.5 MB. At batch=8 and seq=2048, that's 8 × 2048 × 0.5 MB ≈ 8.2 GB — more than the INT4 model itself.

Training Memory — The Real Budget

Training is where GPU memory gets serious. The forward pass stores activations for every layer (needed by the backward pass to compute gradients), the backward pass allocates gradient tensors, and the optimizer step maintains its own state buffers. Let's measure each stage in isolation:

import torch

def profile_training_stages(model, input_ids, labels):
    """Break down memory consumption at each training stage."""
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
    stages = {}

    # Stage 1: Model already loaded — measure baseline
    torch.cuda.synchronize()
    torch.cuda.reset_peak_memory_stats()
    stages["model_params"] = torch.cuda.memory_allocated() / 1e9

    # Stage 2: Forward pass — activations accumulate
    torch.cuda.reset_peak_memory_stats()
    outputs = model(input_ids, labels=labels)
    loss = outputs.loss
    torch.cuda.synchronize()
    stages["after_forward"] = torch.cuda.memory_allocated() / 1e9
    stages["forward_peak"] = torch.cuda.max_memory_allocated() / 1e9

    # Stage 3: Backward pass — gradients allocated
    torch.cuda.reset_peak_memory_stats()
    loss.backward()
    torch.cuda.synchronize()
    stages["after_backward"] = torch.cuda.memory_allocated() / 1e9
    stages["backward_peak"] = torch.cuda.max_memory_allocated() / 1e9

    # Stage 4: Optimizer step — optimizer states created on first call
    torch.cuda.reset_peak_memory_stats()
    optimizer.step()
    torch.cuda.synchronize()
    stages["after_optimizer"] = torch.cuda.memory_allocated() / 1e9
    stages["optimizer_peak"] = torch.cuda.max_memory_allocated() / 1e9

    optimizer.zero_grad(set_to_none=True)
    torch.cuda.synchronize()
    stages["after_cleanup"] = torch.cuda.memory_allocated() / 1e9

    print("Training Memory Waterfall (7B model, FP16, batch=4, seq=512):")
    print(f"  Model parameters:     {stages['model_params']:.1f} GB")
    print(f"  After forward pass:   {stages['after_forward']:.1f} GB  "
          f"(+{stages['after_forward'] - stages['model_params']:.1f} GB activations)")
    print(f"  After backward pass:  {stages['after_backward']:.1f} GB  "
          f"(+{stages['after_backward'] - stages['after_forward']:.1f} GB gradients)")
    print(f"  After optimizer step: {stages['after_optimizer']:.1f} GB  "
          f"(+{stages['after_optimizer'] - stages['after_backward']:.1f} GB opt states)")
    print(f"  After cleanup:        {stages['after_cleanup']:.1f} GB")
    print(f"  Peak during backward: {stages['backward_peak']:.1f} GB")
    return stages

For a 7B model in FP16 with Adam and batch_size=4, seq_len=512, the waterfall looks like this:

Stage Memory Component
Model loaded 14.0 GB Parameters (FP16)
After forward 22.4 GB +8.4 GB activations
After backward 28.0 GB +14.0 GB gradients, −8.4 GB activations freed
After optimizer 84.0 GB +56.0 GB Adam states (m & v in FP32)
After zero_grad 70.0 GB Gradients freed

The optimizer step is the gut punch. Adam stores two FP32 buffers per parameter — the first moment (m) and second moment (v). That's 2 × 7B × 4 bytes = 56 GB of optimizer state for our 7B model. In full mixed-precision setups (like Hugging Face Trainer), you'd also keep FP32 master weights — adding another 28 GB and pushing total optimizer overhead to 84 GB. After the first step, these buffers persist for the entire training run.

This is exactly why LoRA is transformative for memory: by freezing the base model and only training small adapter matrices, you reduce trainable parameters by 100–1000×. The optimizer only needs states for the adapters, dropping 56+ GB of optimizer overhead to well under 1 GB.

Activation Checkpointing

Activation checkpointing (also called gradient checkpointing) is the classic time-memory trade-off: instead of storing all intermediate activations during the forward pass, you only save activations at selected "checkpoint" layers and recompute the others during the backward pass. This can reduce activation memory by 5–10× at the cost of roughly 30% slower training. For models where activations are the bottleneck, it's the difference between fitting and OOM. (See Flash Attention from Scratch for another approach to reducing memory in the attention layers.)

Batch Size, Sequence Length, and the OOM Cliff

Here's something counterintuitive: parameter memory, optimizer states, and gradient memory are all fixed regardless of batch size. A 7B model with Adam uses the same fixed overhead for params + optimizer + gradients — over 80 GB — whether you run batch_size=1 or batch_size=128. The only thing that changes is activations, and they scale linearly with batch_size × sequence_length.

def find_max_batch_size(model, seq_len, gpu_memory_gb, precision="fp16"):
    """Find the maximum batch size that fits in GPU memory."""
    bytes_per_param = {"fp32": 4, "fp16": 2, "bf16": 2, "int8": 1}[precision]
    param_count = sum(p.numel() for p in model.parameters())

    # Fixed memory: params + gradients
    param_mem = param_count * bytes_per_param
    grad_mem = param_count * bytes_per_param

    # Optimizer states (Adam: FP32 master weights + 2 momentum buffers)
    opt_mem = param_count * 4 * 3  # 3 FP32 copies

    fixed_mem = param_mem + grad_mem + opt_mem
    available_for_activations = (gpu_memory_gb * 1e9) - fixed_mem

    # Activation memory per sample (rough estimate for transformers):
    # ~12 * hidden_dim * seq_len * num_layers * bytes_per_param
    # The factor of 12 accounts for attention matrices, layer norms,
    # feed-forward intermediates, and saved inputs for backward
    hidden_dim = model.config.hidden_size
    num_layers = model.config.num_hidden_layers
    act_per_sample = 12 * hidden_dim * seq_len * num_layers * bytes_per_param

    max_batch = int(available_for_activations / act_per_sample)
    max_batch = max(max_batch, 0)

    print(f"Fixed memory:  {fixed_mem / 1e9:.1f} GB "
          f"(params={param_mem/1e9:.1f}, grads={grad_mem/1e9:.1f}, "
          f"optimizer={opt_mem/1e9:.1f})")
    print(f"Activation/sample: {act_per_sample / 1e6:.0f} MB")
    print(f"Available for activations: {available_for_activations / 1e9:.1f} GB")
    print(f"Max batch size: {max_batch}")

    return max_batch

This creates what I call the OOM cliff. You can run batch_size=4 with headroom to spare, batch_size=8 is tight but works, and batch_size=12 instantly crashes with CUDA out of memory. There's no gradual degradation — it's a hard wall.

The memory curve looks like this: a flat base (parameters + optimizer + gradients) plus a linear slope (activation memory per sample). Where that line crosses your GPU's VRAM ceiling is your maximum batch size.

Gradient accumulation is the escape hatch. Instead of processing your entire batch in one forward pass, you process micro-batches and accumulate their gradients before the optimizer step:

effective_batch = micro_batch_size × accumulation_steps

Memory scales with micro_batch_size, not effective_batch_size. Want an effective batch of 64 on a GPU that only fits batch=4? Set accumulation_steps=16. You get identical gradient updates with 4× the total compute time — but it fits.

GPU VRAM 7B (FP16 + Adam) 13B (FP16 + Adam) 7B (INT8 + LoRA)
24 GB OOM OOM bs=8
40 GB OOM OOM bs=32
48 GB OOM OOM bs=48
80 GB OOM OOM bs=96+

This table makes clear why LoRA + quantization has become the default for practitioners. Full fine-tuning of even a 7B model with standard Adam won't fit on a single GPU — you need multi-GPU sharding (FSDP/ZeRO) or 8-bit optimizers. Meanwhile, INT8 + LoRA fits on a 24 GB consumer card with room for decent batch sizes.

Multi-GPU Strategies — When One Card Isn't Enough

When your model simply won't fit on one GPU, there are three fundamental strategies for distributing memory across multiple cards. Each trades off memory savings against communication overhead differently:

def estimate_multi_gpu_memory(param_count_b, num_gpus, strategy, precision="fp16"):
    """Estimate per-GPU memory for different parallelism strategies."""
    bytes_per_param = {"fp32": 4, "fp16": 2, "int8": 1}[precision]
    param_mem = param_count_b * 1e9 * bytes_per_param
    grad_mem = param_mem  # Same as params
    opt_mem = param_count_b * 1e9 * 4 * 3  # Adam: FP32 master + m + v

    if strategy == "data_parallel":
        # Each GPU holds a FULL copy of everything, batches are split
        per_gpu_mem = param_mem + grad_mem + opt_mem
        comm = "AllReduce gradients every step"
        speedup = f"~{num_gpus}x throughput (linear scaling)"

    elif strategy == "pipeline_parallel":
        # Model layers split across GPUs
        per_gpu_mem = (param_mem + grad_mem + opt_mem) / num_gpus
        comm = "Activations sent between pipeline stages"
        speedup = "~{0}x capacity, <{0}x throughput (pipeline bubbles)".format(
            num_gpus)

    elif strategy == "fsdp_zero3":
        # Params, gradients, AND optimizer sharded across GPUs
        per_gpu_mem = (param_mem + grad_mem + opt_mem) / num_gpus
        comm = "AllGather params before forward, ReduceScatter gradients"
        speedup = f"~{num_gpus}x capacity, good throughput with overlap"

    per_gpu_gb = per_gpu_mem / 1e9
    print(f"Strategy: {strategy}")
    print(f"  Per-GPU memory: {per_gpu_gb:.1f} GB (model/grads/optimizer)")
    print(f"  Communication: {comm}")
    print(f"  Scaling: {speedup}")
    return per_gpu_gb

Here's the concrete comparison for a 13B model (26 GB in FP16):

Strategy Per-GPU (2 GPUs) Per-GPU (4 GPUs) Communication Cost
Data Parallel 104 GB (full copy each) 104 GB Gradients synced each step
Pipeline Parallel 52 GB 26 GB Activations between stages
FSDP / ZeRO-3 52 GB 26 GB Params gathered, grads scattered

Data Parallel doesn't save memory at all — it's purely for throughput. Each GPU holds a complete copy. Pipeline Parallel and FSDP both shard the memory, but FSDP is generally preferred because it avoids the "pipeline bubble" problem (GPUs sitting idle while waiting for activations from the previous stage).

The practical decision tree:

  1. Model fits on one GPU? Use single-GPU with quantization (INT8/INT4).
  2. Training fits with LoRA? Use single-GPU with LoRA.
  3. Need more throughput? Data Parallel (simple, linear speedup).
  4. Model doesn't fit at all? FSDP / ZeRO-3 (shards everything).

The Decision Calculator — Will It Fit?

Let's wrap everything into a single function that estimates memory requirements for any configuration. This consolidates the formulas from every section above:

def will_it_fit(
    param_billions,
    precision="fp16",
    training=True,
    optimizer="adam",
    batch_size=1,
    seq_length=2048,
    hidden_dim=4096,
    num_layers=32,
    gpu_vram_gb=24,
):
    """Estimate total GPU memory and whether it fits."""
    bpp = {"fp32": 4, "fp16": 2, "bf16": 2, "int8": 1, "int4": 0.5}[precision]
    params = param_billions * 1e9

    # Model parameters
    param_mem = params * bpp

    # KV cache (inference only — during training, this is part of activations)
    num_heads = hidden_dim // 128  # Standard head_dim = 128
    kv_per_token = 2 * num_layers * num_heads * 128 * bpp
    kv_cache = kv_per_token * batch_size * seq_length if not training else 0

    # Optimizer states
    if training and optimizer == "adam":
        opt_mem = params * 4 * 3  # FP32 master + momentum + variance
    elif training and optimizer == "sgd":
        opt_mem = params * 4      # FP32 master weights only
    else:
        opt_mem = 0

    # Gradients
    grad_mem = params * bpp if training else 0

    # Activations (training only — rough estimate for transformers)
    if training:
        act_mem = 12 * hidden_dim * seq_length * num_layers * bpp * batch_size
    else:
        act_mem = 0

    total = param_mem + opt_mem + grad_mem + act_mem + kv_cache
    total_gb = total / 1e9
    fits = total_gb <= gpu_vram_gb

    components = {
        "Parameters": param_mem / 1e9,
        "Optimizer States": opt_mem / 1e9,
        "Gradients": grad_mem / 1e9,
        "Activations": act_mem / 1e9,
        "KV Cache": kv_cache / 1e9,
    }

    print(f"\n{'='*50}")
    print(f"  {param_billions}B model | {precision.upper()} | "
          f"{'Training' if training else 'Inference'}")
    print(f"  Batch={batch_size}, Seq={seq_length}")
    print(f"{'='*50}")
    for name, gb in components.items():
        if gb > 0:
            bar = "#" * int(gb * 2)
            print(f"  {name:<18} {gb:>7.1f} GB  {bar}")
    print(f"  {'─'*40}")
    print(f"  {'TOTAL':<18} {total_gb:>7.1f} GB")
    print(f"  GPU VRAM:          {gpu_vram_gb:>7.1f} GB")
    print(f"  Status:            {'✓ FITS' if fits else '✗ OOM'}")
    return total_gb, fits

Some quick rules of thumb that hold within ~20% for standard transformer architectures:

Try It: GPU Memory Calculator

Adjust the sliders and options to see a real-time memory breakdown. The stacked bar chart shows exactly where your VRAM goes.

7B
4
2048

Try It: Batch Size Explorer

Watch what happens to memory as batch size increases. Click "Animate" to see the OOM cliff in action.

References & Further Reading

Related DadOps posts: Quantization from Scratch, LoRA from Scratch, Serving LLMs at Scale, KV Cache from Scratch, Flash Attention from Scratch, Profiling Python AI Code, Python Concurrency for AI