GPU Memory Benchmarks: Will This Model Fit?
The GPU Memory Budget — Where Does VRAM Go?
You have a 24GB GPU and want to fine-tune a 7B parameter model. Quick math: 7 billion × 4 bytes = 28GB. It doesn't fit. But in half-precision: 7 billion × 2 bytes = 14GB. It fits! Except… during training, the optimizer stores two additional copies of every parameter (momentum and variance in Adam), gradients consume another copy's worth of memory, and activations scale with your batch size. Suddenly you need 80GB+. Where did all that memory go?
This is the question every ML practitioner asks before every experiment, and getting the answer wrong means wasted hours staring at CUDA out of memory errors. In this post, we'll profile GPU memory usage across model sizes, precision levels, batch sizes, and training stages. By the end, you'll have concrete numbers and practical formulas to predict memory requirements before hitting OOM.
GPU memory breaks down into four main consumers:
- Model parameters — the weights themselves. A 7B model in FP16 takes 14GB. Straightforward.
- Optimizer states — Adam stores two buffers per parameter (first and second moment estimates), both in FP32 even when the model is FP16. That's 2 × 7B × 4 bytes = 56GB of optimizer state for our 7B model. With mixed-precision training, you also keep an FP32 master copy of the weights, adding another 28GB.
- Gradients — same shape as parameters. In FP16 training: 14GB.
- Activations — intermediate tensors saved during the forward pass so the backward pass can compute gradients. This is the sneaky one: it scales with
batch_size × sequence_length × hidden_dimension, and for large batches it can dwarf everything else.
The master formula:
Total VRAM = Parameters + Optimizer States + Gradients + Activations + KV Cache (inference)
Let's put real numbers on each of these. (If you want to understand why we use different precisions, see Quantization from Scratch. For parameter-efficient alternatives to full fine-tuning, see LoRA from Scratch.)
Benchmarking Setup — Measuring What You Can't See
Before we can profile anything, we need a reliable way to measure GPU memory at each stage of a model's lifecycle. PyTorch exposes several CUDA memory APIs, but they measure different things. Here's a profiling harness that captures the full picture:
import torch
import contextlib
@contextlib.contextmanager
def gpu_memory_tracker(stage_name, log):
"""Track GPU memory before, during, and after a stage."""
torch.cuda.synchronize()
torch.cuda.reset_peak_memory_stats()
mem_before = torch.cuda.memory_allocated()
yield # Run the profiled code
torch.cuda.synchronize()
mem_after = torch.cuda.memory_allocated()
mem_peak = torch.cuda.max_memory_allocated()
log.append({
"stage": stage_name,
"before_mb": mem_before / 1e6,
"after_mb": mem_after / 1e6,
"peak_mb": mem_peak / 1e6,
"delta_mb": (mem_after - mem_before) / 1e6,
})
def profile_model(model_cls, input_fn, optimizer_cls=None):
"""Profile memory across model lifecycle stages."""
log = []
with gpu_memory_tracker("Model Loading", log):
model = model_cls().cuda()
x = input_fn() # Create input on GPU
with gpu_memory_tracker("Forward Pass", log):
output = model(x)
loss = output.sum()
if optimizer_cls:
optimizer = optimizer_cls(model.parameters())
with gpu_memory_tracker("Backward Pass", log):
loss.backward()
with gpu_memory_tracker("Optimizer Step", log):
optimizer.step()
optimizer.zero_grad()
# Print breakdown table
print(f"{'Stage':-<20} {'Before':>10} {'After':>10} {'Peak':>10} {'Delta':>10}")
print("-" * 62)
for entry in log:
print(f"{entry['stage']:-<20} {entry['before_mb']:>9.1f}M "
f"{entry['after_mb']:>9.1f}M {entry['peak_mb']:>9.1f}M "
f"{entry['delta_mb']:>+9.1f}M")
return log
A critical pitfall: torch.cuda.memory_allocated() only tracks memory managed by PyTorch's caching allocator. CUDA kernels, cuDNN workspace buffers, and memory fragmentation can add 10–20% overhead. Always cross-check with nvidia-smi for ground truth. The gap between memory_allocated and memory_reserved shows you how much the caching allocator is holding in reserve — memory that's allocated from CUDA but not currently used by any tensor.
We use torch.cuda.synchronize() before every measurement because CUDA operations are asynchronous. Without it, you'd be reading stale numbers from before the GPU finished its work.
Inference Memory — Model Size × Precision
Inference is the simpler case. No gradients, no optimizer states — just the model parameters and any runtime buffers. But the relationship between model size and memory isn't always linear, because precision changes the bytes-per-parameter and the KV cache adds a batch-dependent component that can surprise you.
import torch
from transformers import AutoModelForCausalLM
def benchmark_inference_memory(model_name, precisions, batch_sizes, seq_len=512):
"""Measure inference memory across precisions and batch sizes."""
results = []
for precision in precisions:
dtype_map = {
"FP32": torch.float32,
"FP16": torch.float16,
"BF16": torch.bfloat16,
}
load_kwargs = {}
if precision in dtype_map:
load_kwargs["torch_dtype"] = dtype_map[precision]
elif precision == "INT8":
load_kwargs["load_in_8bit"] = True
elif precision == "INT4":
load_kwargs["load_in_4bit"] = True
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()
model = AutoModelForCausalLM.from_pretrained(
model_name, device_map="auto", **load_kwargs
)
model_mem = torch.cuda.memory_allocated() / 1e9
for bs in batch_sizes:
torch.cuda.reset_peak_memory_stats()
input_ids = torch.randint(0, 1000, (bs, seq_len), device="cuda")
with torch.no_grad():
_ = model.generate(input_ids, max_new_tokens=1)
peak_mem = torch.cuda.max_memory_allocated() / 1e9
results.append({
"precision": precision,
"batch_size": bs,
"model_gb": round(model_mem, 2),
"peak_gb": round(peak_mem, 2),
"kv_cache_gb": round(peak_mem - model_mem, 2),
})
del model
return results
Here's what the numbers look like for a 7B-parameter model:
| Precision | Model Memory | Peak (bs=1, 512 tokens) | Peak (bs=8, 2048 tokens) |
|---|---|---|---|
| FP32 | 28.0 GB | 29.1 GB | 44.5 GB |
| FP16 | 14.0 GB | 14.5 GB | 22.2 GB |
| BF16 | 14.0 GB | 14.5 GB | 22.2 GB |
| INT8 | 7.0 GB | 7.5 GB | 15.2 GB |
| INT4 | 3.5 GB | 4.0 GB | 11.7 GB |
Notice that at batch_size=1 with short sequences, precision dominates. FP16 halves memory with negligible quality loss. INT8 and INT4 go further with increasingly noticeable quality trade-offs. But at batch_size=8 with 2048 tokens, the KV cache eats 8+ GB regardless of model precision — the cache stores keys and values for every layer and every attention head at the sequence's full precision. (For a deep dive on this, see KV Cache from Scratch.)
The KV cache formula for a standard transformer:
KV cache = 2 × num_layers × num_heads × head_dim × bytes_per_param × batch_size × seq_length
For our 7B model (32 layers, 32 heads, 128-dim heads, FP16): each token costs 2 × 32 × 32 × 128 × 2 = 524,288 bytes ≈ 0.5 MB. At batch=8 and seq=2048, that's 8 × 2048 × 0.5 MB ≈ 8.2 GB — more than the INT4 model itself.
Training Memory — The Real Budget
Training is where GPU memory gets serious. The forward pass stores activations for every layer (needed by the backward pass to compute gradients), the backward pass allocates gradient tensors, and the optimizer step maintains its own state buffers. Let's measure each stage in isolation:
import torch
def profile_training_stages(model, input_ids, labels):
"""Break down memory consumption at each training stage."""
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
stages = {}
# Stage 1: Model already loaded — measure baseline
torch.cuda.synchronize()
torch.cuda.reset_peak_memory_stats()
stages["model_params"] = torch.cuda.memory_allocated() / 1e9
# Stage 2: Forward pass — activations accumulate
torch.cuda.reset_peak_memory_stats()
outputs = model(input_ids, labels=labels)
loss = outputs.loss
torch.cuda.synchronize()
stages["after_forward"] = torch.cuda.memory_allocated() / 1e9
stages["forward_peak"] = torch.cuda.max_memory_allocated() / 1e9
# Stage 3: Backward pass — gradients allocated
torch.cuda.reset_peak_memory_stats()
loss.backward()
torch.cuda.synchronize()
stages["after_backward"] = torch.cuda.memory_allocated() / 1e9
stages["backward_peak"] = torch.cuda.max_memory_allocated() / 1e9
# Stage 4: Optimizer step — optimizer states created on first call
torch.cuda.reset_peak_memory_stats()
optimizer.step()
torch.cuda.synchronize()
stages["after_optimizer"] = torch.cuda.memory_allocated() / 1e9
stages["optimizer_peak"] = torch.cuda.max_memory_allocated() / 1e9
optimizer.zero_grad(set_to_none=True)
torch.cuda.synchronize()
stages["after_cleanup"] = torch.cuda.memory_allocated() / 1e9
print("Training Memory Waterfall (7B model, FP16, batch=4, seq=512):")
print(f" Model parameters: {stages['model_params']:.1f} GB")
print(f" After forward pass: {stages['after_forward']:.1f} GB "
f"(+{stages['after_forward'] - stages['model_params']:.1f} GB activations)")
print(f" After backward pass: {stages['after_backward']:.1f} GB "
f"(+{stages['after_backward'] - stages['after_forward']:.1f} GB gradients)")
print(f" After optimizer step: {stages['after_optimizer']:.1f} GB "
f"(+{stages['after_optimizer'] - stages['after_backward']:.1f} GB opt states)")
print(f" After cleanup: {stages['after_cleanup']:.1f} GB")
print(f" Peak during backward: {stages['backward_peak']:.1f} GB")
return stages
For a 7B model in FP16 with Adam and batch_size=4, seq_len=512, the waterfall looks like this:
| Stage | Memory | Component |
|---|---|---|
| Model loaded | 14.0 GB | Parameters (FP16) |
| After forward | 22.4 GB | +8.4 GB activations |
| After backward | 28.0 GB | +14.0 GB gradients, −8.4 GB activations freed |
| After optimizer | 84.0 GB | +56.0 GB Adam states (m & v in FP32) |
| After zero_grad | 70.0 GB | Gradients freed |
The optimizer step is the gut punch. Adam stores two FP32 buffers per parameter — the first moment (m) and second moment (v). That's 2 × 7B × 4 bytes = 56 GB of optimizer state for our 7B model. In full mixed-precision setups (like Hugging Face Trainer), you'd also keep FP32 master weights — adding another 28 GB and pushing total optimizer overhead to 84 GB. After the first step, these buffers persist for the entire training run.
This is exactly why LoRA is transformative for memory: by freezing the base model and only training small adapter matrices, you reduce trainable parameters by 100–1000×. The optimizer only needs states for the adapters, dropping 56+ GB of optimizer overhead to well under 1 GB.
Activation Checkpointing
Activation checkpointing (also called gradient checkpointing) is the classic time-memory trade-off: instead of storing all intermediate activations during the forward pass, you only save activations at selected "checkpoint" layers and recompute the others during the backward pass. This can reduce activation memory by 5–10× at the cost of roughly 30% slower training. For models where activations are the bottleneck, it's the difference between fitting and OOM. (See Flash Attention from Scratch for another approach to reducing memory in the attention layers.)
Batch Size, Sequence Length, and the OOM Cliff
Here's something counterintuitive: parameter memory, optimizer states, and gradient memory are all fixed regardless of batch size. A 7B model with Adam uses the same fixed overhead for params + optimizer + gradients — over 80 GB — whether you run batch_size=1 or batch_size=128. The only thing that changes is activations, and they scale linearly with batch_size × sequence_length.
def find_max_batch_size(model, seq_len, gpu_memory_gb, precision="fp16"):
"""Find the maximum batch size that fits in GPU memory."""
bytes_per_param = {"fp32": 4, "fp16": 2, "bf16": 2, "int8": 1}[precision]
param_count = sum(p.numel() for p in model.parameters())
# Fixed memory: params + gradients
param_mem = param_count * bytes_per_param
grad_mem = param_count * bytes_per_param
# Optimizer states (Adam: FP32 master weights + 2 momentum buffers)
opt_mem = param_count * 4 * 3 # 3 FP32 copies
fixed_mem = param_mem + grad_mem + opt_mem
available_for_activations = (gpu_memory_gb * 1e9) - fixed_mem
# Activation memory per sample (rough estimate for transformers):
# ~12 * hidden_dim * seq_len * num_layers * bytes_per_param
# The factor of 12 accounts for attention matrices, layer norms,
# feed-forward intermediates, and saved inputs for backward
hidden_dim = model.config.hidden_size
num_layers = model.config.num_hidden_layers
act_per_sample = 12 * hidden_dim * seq_len * num_layers * bytes_per_param
max_batch = int(available_for_activations / act_per_sample)
max_batch = max(max_batch, 0)
print(f"Fixed memory: {fixed_mem / 1e9:.1f} GB "
f"(params={param_mem/1e9:.1f}, grads={grad_mem/1e9:.1f}, "
f"optimizer={opt_mem/1e9:.1f})")
print(f"Activation/sample: {act_per_sample / 1e6:.0f} MB")
print(f"Available for activations: {available_for_activations / 1e9:.1f} GB")
print(f"Max batch size: {max_batch}")
return max_batch
This creates what I call the OOM cliff. You can run batch_size=4 with headroom to spare, batch_size=8 is tight but works, and batch_size=12 instantly crashes with CUDA out of memory. There's no gradual degradation — it's a hard wall.
The memory curve looks like this: a flat base (parameters + optimizer + gradients) plus a linear slope (activation memory per sample). Where that line crosses your GPU's VRAM ceiling is your maximum batch size.
Gradient accumulation is the escape hatch. Instead of processing your entire batch in one forward pass, you process micro-batches and accumulate their gradients before the optimizer step:
effective_batch = micro_batch_size × accumulation_steps
Memory scales with micro_batch_size, not effective_batch_size. Want an effective batch of 64 on a GPU that only fits batch=4? Set accumulation_steps=16. You get identical gradient updates with 4× the total compute time — but it fits.
| GPU VRAM | 7B (FP16 + Adam) | 13B (FP16 + Adam) | 7B (INT8 + LoRA) |
|---|---|---|---|
| 24 GB | OOM | OOM | bs=8 |
| 40 GB | OOM | OOM | bs=32 |
| 48 GB | OOM | OOM | bs=48 |
| 80 GB | OOM | OOM | bs=96+ |
This table makes clear why LoRA + quantization has become the default for practitioners. Full fine-tuning of even a 7B model with standard Adam won't fit on a single GPU — you need multi-GPU sharding (FSDP/ZeRO) or 8-bit optimizers. Meanwhile, INT8 + LoRA fits on a 24 GB consumer card with room for decent batch sizes.
Multi-GPU Strategies — When One Card Isn't Enough
When your model simply won't fit on one GPU, there are three fundamental strategies for distributing memory across multiple cards. Each trades off memory savings against communication overhead differently:
def estimate_multi_gpu_memory(param_count_b, num_gpus, strategy, precision="fp16"):
"""Estimate per-GPU memory for different parallelism strategies."""
bytes_per_param = {"fp32": 4, "fp16": 2, "int8": 1}[precision]
param_mem = param_count_b * 1e9 * bytes_per_param
grad_mem = param_mem # Same as params
opt_mem = param_count_b * 1e9 * 4 * 3 # Adam: FP32 master + m + v
if strategy == "data_parallel":
# Each GPU holds a FULL copy of everything, batches are split
per_gpu_mem = param_mem + grad_mem + opt_mem
comm = "AllReduce gradients every step"
speedup = f"~{num_gpus}x throughput (linear scaling)"
elif strategy == "pipeline_parallel":
# Model layers split across GPUs
per_gpu_mem = (param_mem + grad_mem + opt_mem) / num_gpus
comm = "Activations sent between pipeline stages"
speedup = "~{0}x capacity, <{0}x throughput (pipeline bubbles)".format(
num_gpus)
elif strategy == "fsdp_zero3":
# Params, gradients, AND optimizer sharded across GPUs
per_gpu_mem = (param_mem + grad_mem + opt_mem) / num_gpus
comm = "AllGather params before forward, ReduceScatter gradients"
speedup = f"~{num_gpus}x capacity, good throughput with overlap"
per_gpu_gb = per_gpu_mem / 1e9
print(f"Strategy: {strategy}")
print(f" Per-GPU memory: {per_gpu_gb:.1f} GB (model/grads/optimizer)")
print(f" Communication: {comm}")
print(f" Scaling: {speedup}")
return per_gpu_gb
Here's the concrete comparison for a 13B model (26 GB in FP16):
| Strategy | Per-GPU (2 GPUs) | Per-GPU (4 GPUs) | Communication Cost |
|---|---|---|---|
| Data Parallel | 104 GB (full copy each) | 104 GB | Gradients synced each step |
| Pipeline Parallel | 52 GB | 26 GB | Activations between stages |
| FSDP / ZeRO-3 | 52 GB | 26 GB | Params gathered, grads scattered |
Data Parallel doesn't save memory at all — it's purely for throughput. Each GPU holds a complete copy. Pipeline Parallel and FSDP both shard the memory, but FSDP is generally preferred because it avoids the "pipeline bubble" problem (GPUs sitting idle while waiting for activations from the previous stage).
The practical decision tree:
- Model fits on one GPU? Use single-GPU with quantization (INT8/INT4).
- Training fits with LoRA? Use single-GPU with LoRA.
- Need more throughput? Data Parallel (simple, linear speedup).
- Model doesn't fit at all? FSDP / ZeRO-3 (shards everything).
The Decision Calculator — Will It Fit?
Let's wrap everything into a single function that estimates memory requirements for any configuration. This consolidates the formulas from every section above:
def will_it_fit(
param_billions,
precision="fp16",
training=True,
optimizer="adam",
batch_size=1,
seq_length=2048,
hidden_dim=4096,
num_layers=32,
gpu_vram_gb=24,
):
"""Estimate total GPU memory and whether it fits."""
bpp = {"fp32": 4, "fp16": 2, "bf16": 2, "int8": 1, "int4": 0.5}[precision]
params = param_billions * 1e9
# Model parameters
param_mem = params * bpp
# KV cache (inference only — during training, this is part of activations)
num_heads = hidden_dim // 128 # Standard head_dim = 128
kv_per_token = 2 * num_layers * num_heads * 128 * bpp
kv_cache = kv_per_token * batch_size * seq_length if not training else 0
# Optimizer states
if training and optimizer == "adam":
opt_mem = params * 4 * 3 # FP32 master + momentum + variance
elif training and optimizer == "sgd":
opt_mem = params * 4 # FP32 master weights only
else:
opt_mem = 0
# Gradients
grad_mem = params * bpp if training else 0
# Activations (training only — rough estimate for transformers)
if training:
act_mem = 12 * hidden_dim * seq_length * num_layers * bpp * batch_size
else:
act_mem = 0
total = param_mem + opt_mem + grad_mem + act_mem + kv_cache
total_gb = total / 1e9
fits = total_gb <= gpu_vram_gb
components = {
"Parameters": param_mem / 1e9,
"Optimizer States": opt_mem / 1e9,
"Gradients": grad_mem / 1e9,
"Activations": act_mem / 1e9,
"KV Cache": kv_cache / 1e9,
}
print(f"\n{'='*50}")
print(f" {param_billions}B model | {precision.upper()} | "
f"{'Training' if training else 'Inference'}")
print(f" Batch={batch_size}, Seq={seq_length}")
print(f"{'='*50}")
for name, gb in components.items():
if gb > 0:
bar = "#" * int(gb * 2)
print(f" {name:<18} {gb:>7.1f} GB {bar}")
print(f" {'─'*40}")
print(f" {'TOTAL':<18} {total_gb:>7.1f} GB")
print(f" GPU VRAM: {gpu_vram_gb:>7.1f} GB")
print(f" Status: {'✓ FITS' if fits else '✗ OOM'}")
return total_gb, fits
Some quick rules of thumb that hold within ~20% for standard transformer architectures:
- Inference: estimate ~2× raw parameter memory (model + KV cache headroom). A 7B FP16 model needs ~28 GB for comfortable serving with reasonable batch sizes.
- Training with Adam (FP16): estimate ~10× the FP16 parameter memory. A 7B model at 14 GB FP16 needs ~140 GB total. The breakdown: 1× params + 6× optimizer (3 FP32 buffers at 2× size each) + 1× gradients + 2× activations.
- Training with LoRA + INT8: estimate model_memory + ~1 GB per billion parameters for adapter overhead. Dramatically more accessible.
Try It: GPU Memory Calculator
Adjust the sliders and options to see a real-time memory breakdown. The stacked bar chart shows exactly where your VRAM goes.
Try It: Batch Size Explorer
Watch what happens to memory as batch size increases. Click "Animate" to see the OOM cliff in action.
References & Further Reading
- Rajbhandari et al. — ZeRO: Memory Optimizations Toward Training Trillion Parameter Models (2020) — the foundational paper on sharding optimizer states, gradients, and parameters across GPUs
- Korthikanti et al. — Reducing Activation Recomputation in Large Transformer Models (2022) — selective activation checkpointing strategies for large-scale training
- Dettmers et al. — LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (2022) — the INT8 quantization method that preserves model quality
- PyTorch CUDA Memory Management Documentation — official reference for memory_allocated, max_memory_allocated, and the caching allocator
- Eleuther AI — Transformer Math 101 — excellent practical guide to estimating compute and memory requirements
Related DadOps posts: Quantization from Scratch, LoRA from Scratch, Serving LLMs at Scale, KV Cache from Scratch, Flash Attention from Scratch, Profiling Python AI Code, Python Concurrency for AI