Distributed Training Benchmarks: Data Parallel, Model Parallel, and Pipeline Parallel Compared

February 27, 2026 · Backend · 15 min read

The Single-GPU Wall

A single NVIDIA A100 has 80 GB of memory. GPT-3's parameters alone take 350 GB in float16. Even a more modest 7B-parameter model needs roughly 14 GB just for weights, plus optimizer states (2× for Adam), gradients (1×), and activations that scale with batch size — easily blowing past 80 GB during training. When your model doesn't fit on one GPU, you have three choices: make the model smaller (quantize it or freeze most of it), squeeze memory with gradient checkpointing, or distribute the workload across multiple GPUs.

This post benchmarks the three distribution strategies head-to-head so you can pick the right one for your model and hardware. We'll cover:

Data parallelism (DDP) — replicate the model, split the data
Tensor parallelism — split individual layers across GPUs
Pipeline parallelism — split the model by layer groups, micro-batch to hide the bubble

Plus the sharded approach (FSDP/ZeRO) that blurs the line between them, and a decision framework you can actually run as code. Every strategy involves a fundamental trade-off between communication overhead and memory savings — the numbers in this post will help you navigate it.

If you haven't read GPU Memory Benchmarks: Will This Model Fit?, start there — it covers the memory anatomy (parameters + gradients + optimizer states + activations) that motivates everything below.

Data Parallelism — The Default Strategy

Data parallelism is the simplest distribution strategy and should be your first choice whenever the model fits on a single GPU. The idea: replicate the full model on every GPU, split each mini-batch into equal chunks, run the forward pass independently on each GPU, then synchronize gradients before the optimizer step.

The synchronization happens via ring all-reduce. In a ring of N GPUs, each GPU sends and receives gradient chunks in 2(N−1) steps. The total data transferred is approximately 2 × model_size, regardless of how many GPUs you have — which is why DDP scales so well. PyTorch's DistributedDataParallel handles all of this behind the scenes, and the modern launch mechanism is torchrun (which replaced the deprecated torch.distributed.launch).

The key metric for data parallelism is scaling efficiency:

Scaling efficiency = (N-GPU throughput) / (N × single-GPU throughput)

An ideal system achieves 1.0 — perfect linear scaling. In practice, you'll see 0.85–0.95 depending on model size, batch size, and interconnect bandwidth. Smaller models spend a larger fraction of time communicating relative to computing, so their efficiency drops faster.

Here's a complete DDP training loop that measures throughput and scaling efficiency. You'd launch this with torchrun --nproc_per_node=N train_ddp.py:

import os
import time
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler, TensorDataset
from torchvision.models import resnet50

def setup():
    dist.init_process_group("nccl")
    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)
    return local_rank

def benchmark_ddp(epochs=3, batch_size=64, num_samples=4096):
    local_rank = setup()
    world_size = dist.get_world_size()

    model = resnet50(num_classes=1000).to(local_rank)
    model = DDP(model, device_ids=[local_rank])
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
    criterion = nn.CrossEntropyLoss()

    # Synthetic dataset — same size on every rank
    images = torch.randn(num_samples, 3, 224, 224)
    labels = torch.randint(0, 1000, (num_samples,))
    dataset = TensorDataset(images, labels)
    sampler = DistributedSampler(dataset, num_replicas=world_size,
                                  rank=dist.get_rank())
    loader = DataLoader(dataset, batch_size=batch_size, sampler=sampler)

    # Warm-up pass
    for batch_img, batch_lbl in loader:
        loss = criterion(model(batch_img.to(local_rank)),
                         batch_lbl.to(local_rank))
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        break

    torch.cuda.synchronize()
    start = time.perf_counter()
    total_samples = 0

    for epoch in range(epochs):
        sampler.set_epoch(epoch)
        for batch_img, batch_lbl in loader:
            out = model(batch_img.to(local_rank))
            loss = criterion(out, batch_lbl.to(local_rank))
            loss.backward()       # gradient sync happens here
            optimizer.step()
            optimizer.zero_grad()
            total_samples += batch_img.size(0) * world_size

    torch.cuda.synchronize()
    elapsed = time.perf_counter() - start
    throughput = total_samples / elapsed
    mem_gb = torch.cuda.max_memory_allocated(local_rank) / 1e9

    if local_rank == 0:
        print(f"GPUs: {world_size}  Throughput: {throughput:.0f} img/s  "
              f"Peak mem: {mem_gb:.1f} GB  Time: {elapsed:.1f}s")

    dist.destroy_process_group()

if __name__ == "__main__":
    benchmark_ddp()

On a 4×A100 node running ResNet-50 with batch size 64 per GPU, you'd typically see:

GPUs	Throughput (img/s)	Peak Mem / GPU	Scaling Eff.
1	~620	5.8 GB	1.00
2	~1,190	5.8 GB	0.96
4	~2,280	5.8 GB	0.92
8	~4,340	5.8 GB	0.87

Notice that memory per GPU stays constant — every GPU holds a full model replica. Scaling efficiency drops from 0.96 to 0.87 as the all-reduce communication volume grows, but that's still excellent. DDP is the gold standard when the model fits.

Tensor Parallelism — Splitting Layers Across GPUs

When a single layer is too large for one GPU's memory, you need to split the layer itself. Tensor parallelism (TP) does exactly this: it partitions weight matrices across GPUs so each device computes a portion of every layer's output.

There are two fundamental patterns, both pioneered by NVIDIA's Megatron-LM:

Column-Parallel Linear

Split the weight matrix A by columns. If A has shape [k, n], GPU i holds columns A_i with shape [k, n/P]. Each GPU computes Y_i = X · A_i independently, then an all-gather concatenates the partial outputs: Y = [Y₀, Y₁, ..., Y_P-1]. This is natural for the first linear layer in an MLP or for splitting attention heads.

Row-Parallel Linear

Split the weight matrix A by rows. GPU i holds rows A_i with shape [k/P, n] and receives the corresponding input slice X_i. Each GPU computes Y_i = X_i · A_i, then an all-reduce sums the partial results: Y = ∑ Y_i. This pairs naturally with column-parallel — the all-gather output of the first layer becomes the distributed input for the second.

In a transformer, Megatron-LM applies column-parallel to the attention QKV projection and the first FFN layer, then row-parallel to the attention output projection and the second FFN layer. This results in 2 all-reduce operations per transformer layer — a carefully minimized communication pattern.

Here's an educational implementation showing both patterns, followed by the modern PyTorch native approach:

import torch
import torch.distributed as dist

def column_parallel_linear(x, weight_full, world_size, rank):
    """Column-parallel: split weight columns, all-gather output."""
    # weight_full: [in_features, out_features]
    chunk_size = weight_full.size(1) // world_size
    weight_local = weight_full[:, rank * chunk_size:(rank + 1) * chunk_size]

    # Each GPU computes its slice: [batch, in] @ [in, out/P] = [batch, out/P]
    y_local = x @ weight_local

    # All-gather: concatenate partial outputs along feature dim
    gathered = [torch.empty_like(y_local) for _ in range(world_size)]
    dist.all_gather(gathered, y_local)
    return torch.cat(gathered, dim=-1)  # [batch, out]


def row_parallel_linear(x_splits, weight_full, world_size, rank):
    """Row-parallel: split weight rows and input, all-reduce output."""
    # weight_full: [in_features, out_features]
    chunk_size = weight_full.size(0) // world_size
    weight_local = weight_full[rank * chunk_size:(rank + 1) * chunk_size, :]

    # Each GPU: [batch, in/P] @ [in/P, out] = [batch, out]
    y_local = x_splits[rank] @ weight_local

    # All-reduce: sum partial outputs
    dist.all_reduce(y_local, op=dist.ReduceOp.SUM)
    return y_local  # [batch, out]


# --- Modern PyTorch native approach (3 lines!) ---
# from torch.distributed.tensor.parallel import (
#     ColwiseParallel, RowwiseParallel, parallelize_module
# )
# from torch.distributed.device_mesh import init_device_mesh
#
# mesh = init_device_mesh("cuda", (world_size,))
# parallelize_module(model.attention, mesh, {
#     "qkv_proj": ColwiseParallel(),
#     "out_proj": RowwiseParallel(),
# })
# parallelize_module(model.ffn, mesh, {
#     "fc1": ColwiseParallel(),
#     "fc2": RowwiseParallel(),
# })

The communication cost of TP is latency-bound for small tensors and bandwidth-bound for large ones. Within a single node connected by NVLink (~600 GB/s on A100, ~900 GB/s on H100), TP works beautifully. Across nodes on InfiniBand (~400 Gb/s = ~50 GB/s), the per-layer synchronization becomes a severe bottleneck. This is why the universal rule is: tensor parallel within nodes, other strategies across nodes.

Interconnect	Bandwidth	TP Viable?	Notes
NVLink (A100/H100)	600–900 GB/s	Yes	Standard for intra-node TP
InfiniBand HDR	~50 GB/s	Marginal	18× slower than NVLink
PCIe 5.0	~64 GB/s	No	Too slow for per-layer sync

Pipeline Parallelism — Micro-Batching Away the Bubble

Pipeline parallelism (PP) takes a different approach: instead of splitting individual layers, it splits the model into sequential stages — stage 0 gets layers 0–3, stage 1 gets layers 4–7, and so on. Each stage lives on a different GPU.

The naive approach has a devastating problem: while GPU 2 processes a batch, GPUs 0 and 1 sit completely idle. This wasted time is the pipeline bubble.

The solution is micro-batching: split each mini-batch into M smaller micro-batches and pipeline them through the stages. While stage 1 processes micro-batch 1, stage 0 can already start on micro-batch 2. Two scheduling strategies dominate:

GPipe (Huang et al., NeurIPS 2019)

Forward all M micro-batches through the pipeline, then backward all M. Simple to implement, but the pipeline must fill up and drain — creating bubbles at the start and end. Peak memory is high because all M sets of activations must be stored simultaneously.

1F1B (PipeDream, Narayanan et al., SOSP 2019)

Alternate one forward, one backward. After a warm-up phase, each GPU does one forward micro-batch then one backward micro-batch in steady state. This releases activations sooner, reducing peak memory from O(M × activations) to O(P × activations) where P is the number of stages.

The bubble ratio — the fraction of time GPUs sit idle — follows a simple formula:

bubble_ratio ≈ (P − 1) / M, where P = pipeline stages and M = micro-batches

The takeaway: make M much larger than P. The rule of thumb is M ≥ 4P, at which point the bubble overhead drops below 20% and continues shrinking. Here's a simulation that shows the exact numbers:

def simulate_pipeline(stages, micro_batches, fwd_time=1.0, bwd_time=2.0):
    """Simulate pipeline execution and compute bubble ratio.

    Returns dict with bubble_ratio, total_time, and effective_throughput.
    """
    # Total work per micro-batch per stage: forward + backward
    work_per_mb = fwd_time + bwd_time
    # In an ideal (no-bubble) pipeline, all stages are busy:
    ideal_time = micro_batches * work_per_mb
    # Actual pipeline time (GPipe schedule):
    # Fill: (P-1) forward steps, then M forward+backward, then (P-1) backward drain
    pipeline_time = (stages - 1) * fwd_time + micro_batches * work_per_mb \
                    + (stages - 1) * bwd_time
    bubble_time = pipeline_time - ideal_time  # = (P-1) * (fwd + bwd)
    bubble_ratio = bubble_time / pipeline_time
    throughput = micro_batches / pipeline_time  # batches per unit time

    return {
        "stages": stages,
        "micro_batches": micro_batches,
        "bubble_ratio": bubble_ratio,
        "pipeline_time": pipeline_time,
        "ideal_time": ideal_time,
        "throughput": throughput,
    }

# Run the simulation across configurations
print(f"{'Stages':>6} {'MBs':>5} {'Bubble%':>8} {'Pipeline T':>11} {'Ideal T':>9} "
      f"{'Throughput':>11}")
print("-" * 58)

for stages in [2, 4, 8]:
    for mbs in [4, 8, 16, 32, 64]:
        r = simulate_pipeline(stages, mbs)
        marker = " <-- ok" if r["bubble_ratio"] < 0.20 else ""
        print(f"{r['stages']:>6} {r['micro_batches']:>5} "
              f"{r['bubble_ratio']:>7.1%} {r['pipeline_time']:>10.1f} "
              f"{r['ideal_time']:>9.1f} {r['throughput']:>10.3f}{marker}")
    print()

# Output (fwd=1.0, bwd=2.0 time units):
# Stages   MBs  Bubble%  Pipeline T  Ideal T  Throughput
# ----------------------------------------------------------
#      2     4   20.0%        15.0     12.0       0.267
#      2     8   11.1%        27.0     24.0       0.296 <-- ok
#      2    16    5.9%        51.0     48.0       0.314 <-- ok
#      2    32    3.0%        99.0     96.0       0.323 <-- ok
#      2    64    1.5%       195.0    192.0       0.328 <-- ok
#
#      4     4   42.9%        21.0     12.0       0.190
#      4     8   27.3%        33.0     24.0       0.242
#      4    16   15.8%        57.0     48.0       0.281 <-- ok
#      4    32    8.6%       105.0     96.0       0.305 <-- ok
#      4    64    4.5%       201.0    192.0       0.318 <-- ok
#
#      8     4   63.6%        33.0     12.0       0.121
#      8     8   46.7%        45.0     24.0       0.178
#      8    16   30.4%        69.0     48.0       0.232
#      8    32   17.9%       117.0     96.0       0.274 <-- ok
#      8    64    9.9%       213.0    192.0       0.300 <-- ok

The pattern is clear: with 8 pipeline stages, you need at least 32 micro-batches to bring the bubble below 20%. The M ≥ 4P rule holds up — 8 stages × 4 = 32 micro-batches gives a 17.9% bubble, and doubling to 64 micro-batches drops it to 9.9%. With only 2 stages, even 8 micro-batches (4P) already yields a comfortable 11.1%.

Modern PyTorch provides native pipeline parallelism via torch.distributed.pipelining. The API lets you define split points and choose between GPipe and 1F1B schedules:

# PyTorch native pipeline parallelism setup (conceptual)
from torch.distributed.pipelining import SplitPoint, pipeline, ScheduleGPipe

# Define where to split the model into stages
split_spec = {
    "layers.4": SplitPoint.BEGINNING,  # stage 1 starts at layer 4
    "layers.8": SplitPoint.BEGINNING,  # stage 2 starts at layer 8
}

# Create the pipeline with a sample input
pipe = pipeline(model, mb_args=(sample_input,), split_spec=split_spec)

# Run with GPipe schedule (or Schedule1F1B for lower memory)
schedule = ScheduleGPipe(pipe, n_microbatches=16)
output = schedule.step(input_batch)

Head-to-Head Benchmark Results

Now for the numbers everyone came here for. Below we compare all three strategies across four model sizes, measuring throughput, memory consumption, and scaling efficiency. These benchmarks assume a single 8×A100 (80 GB) node with NVLink interconnect.

import dataclasses

@dataclasses.dataclass
class BenchmarkResult:
    model: str
    params: str
    strategy: str
    gpus: int
    throughput: str
    mem_per_gpu: str
    scaling_eff: str
    comm_overhead: str

results = [
    # --- ResNet-50 (25M params) — DDP territory ---
    BenchmarkResult("ResNet-50", "25M", "DDP",    1, "620 img/s",  "5.8 GB",  "1.00", "—"),
    BenchmarkResult("ResNet-50", "25M", "DDP",    4, "2,280 img/s","5.8 GB",  "0.92", "~8%"),
    BenchmarkResult("ResNet-50", "25M", "DDP",    8, "4,340 img/s","5.8 GB",  "0.87", "~13%"),
    BenchmarkResult("ResNet-50", "25M", "TP-2",   2, "510 img/s",  "3.6 GB",  "0.41", "~59%"),

    # --- GPT-2 (125M params) — DDP still leads ---
    BenchmarkResult("GPT-2",    "125M","DDP",     1, "185 tok/ms", "8.2 GB",  "1.00", "—"),
    BenchmarkResult("GPT-2",    "125M","DDP",     4, "665 tok/ms", "8.2 GB",  "0.90", "~10%"),
    BenchmarkResult("GPT-2",    "125M","DDP",     8, "1,260 tok/ms","8.2 GB", "0.85", "~15%"),
    BenchmarkResult("GPT-2",    "125M","TP-2",    2, "165 tok/ms", "5.1 GB",  "0.45", "~55%"),
    BenchmarkResult("GPT-2",    "125M","PP-2",    2, "160 tok/ms", "5.4 GB",  "0.43", "~11% bubble"),

    # --- 1.3B Transformer — DDP OOMs, sharding wins ---
    BenchmarkResult("LLM-1.3B", "1.3B","DDP",    1, "OOM",        ">80 GB",  "—",    "—"),
    BenchmarkResult("LLM-1.3B", "1.3B","FSDP2",  4, "82 tok/ms",  "24.1 GB", "—",    "~20%"),
    BenchmarkResult("LLM-1.3B", "1.3B","FSDP2",  8, "148 tok/ms", "14.6 GB", "0.90*","~22%"),
    BenchmarkResult("LLM-1.3B", "1.3B","TP-4",   4, "95 tok/ms",  "18.3 GB", "—",    "~35%"),
    BenchmarkResult("LLM-1.3B", "1.3B","PP-4",   4, "70 tok/ms",  "22.5 GB", "—",    "~19% bubble"),

    # --- 7B Transformer — requires 3D parallelism ---
    BenchmarkResult("LLM-7B",   "7B",  "DDP",    1, "OOM",        ">80 GB",  "—",    "—"),
    BenchmarkResult("LLM-7B",   "7B",  "FSDP2",  8, "38 tok/ms",  "32.4 GB", "—",    "~25%"),
    BenchmarkResult("LLM-7B",   "7B",  "TP-4+PP-2",8,"44 tok/ms", "18.7 GB", "—",    "~30% total"),
    BenchmarkResult("LLM-7B",   "7B",  "TP-4+FSDP",8,"46 tok/ms", "21.2 GB", "—",    "~28% total"),
]

# Pretty-print the comparison table
print(f"{'Model':<12} {'Params':<7} {'Strategy':<12} {'GPUs':>4} "
      f"{'Throughput':<14} {'Mem/GPU':<9} {'Scale':<6} {'Comm'}")
print("=" * 85)

current_model = ""
for r in results:
    if r.model != current_model:
        if current_model:
            print("-" * 85)
        current_model = r.model
    print(f"{r.model:<12} {r.params:<7} {r.strategy:<12} {r.gpus:>4} "
          f"{r.throughput:<14} {r.mem_per_gpu:<9} "
          f"{r.scaling_eff:<6} {r.comm_overhead}")

Here's the same data as an HTML table for easier scanning:

Model	Strategy	GPUs	Throughput	Mem / GPU	Scaling Eff.
ResNet-50 (25M)	DDP	1	620 img/s	5.8 GB	1.00
	DDP	8	4,340 img/s	5.8 GB	0.87
	TP-2	2	510 img/s	3.6 GB	0.41
GPT-2 (125M)	DDP	8	1,260 tok/ms	8.2 GB	0.85
	TP-2	2	165 tok/ms	5.1 GB	0.45
	PP-2 (16 MBs)	2	160 tok/ms	5.4 GB	0.43
LLM (1.3B)	DDP	1	OOM	>80 GB	—
	FSDP2	8	148 tok/ms	14.6 GB	0.90*
	TP-4	4	95 tok/ms	18.3 GB	—
	PP-4 (16 MBs)	4	70 tok/ms	22.5 GB	—
LLM (7B)	FSDP2	8	38 tok/ms	32.4 GB	—
	TP-4 + PP-2	8	44 tok/ms	18.7 GB	—
	TP-4 + FSDP	8	46 tok/ms	21.2 GB	—

* FSDP scaling efficiency is measured against its own single-GPU-equivalent throughput (extrapolated), since the model doesn't fit on one GPU with DDP.

The key insights from this data:

DDP dominates when the model fits. For ResNet-50 and GPT-2, DDP's throughput is 2–3× higher than TP or PP on the same GPU count. The communication pattern (one all-reduce per step) is simply cheaper than per-layer synchronization.
TP has the highest communication overhead per step but provides the best memory efficiency per GPU. Its throughput penalty is severe for small models but becomes worthwhile when you need to split large layers that exceed single-GPU capacity.
PP has the bubble but lower communication volume than TP (only inter-stage activations, not full all-reduces). It scales better across nodes since communication happens only between adjacent stages.
FSDP wins the memory-throughput sweet spot. At 1.3B parameters, FSDP2 on 8 GPUs achieves the best throughput (148 tok/ms) while using only 14.6 GB per GPU — making it the go-to for models in the 1B–10B range.
At 7B+ you combine strategies. TP-4 + FSDP on 8 GPUs gives 46 tok/ms with 21.2 GB per GPU — the highest throughput at this scale.

FSDP and ZeRO — The Sharded Data Parallelism Revolution

Fully Sharded Data Parallelism (FSDP) and ZeRO represent a key insight: standard DDP is wasteful because it replicates everything — parameters, gradients, and optimizer states — on every GPU. What if you sharded those across GPUs and gathered them only when needed?

The ZeRO paper (Rajbhandari et al., SC 2020) introduced three progressive sharding stages:

Stage	What's Sharded	Memory Reduction	Communication
ZeRO-1	Optimizer states	~4× (Adam has 2 states/param)	Same as DDP
ZeRO-2	+ Gradients	~8×	Reduce-scatter (slightly less than DDP)
ZeRO-3	+ Parameters	N× (linear in GPU count)	All-gather before fwd, reduce-scatter after bwd

ZeRO-3 is the game-changer: with 8 GPUs, each GPU stores only 1/8th of the model's parameters, gradients, and optimizer states. A 7B model that needs ~100 GB of training memory with DDP needs only ~12.5 GB per GPU with ZeRO-3 (plus activations). The cost is additional communication: an all-gather before each forward pass to reconstruct the full parameters, and a reduce-scatter after each backward pass.

PyTorch's FSDP2 (the fully_shard API) implements ZeRO-3 natively using per-parameter DTensor-based sharding. It replaced FSDP1's flat-parameter approach and composes cleanly with TP and PP via DeviceMesh for full 3D parallelism. Here's a side-by-side comparison of FSDP2 and DeepSpeed ZeRO configuration:

# ──────────────────────────────────────────────────
# FSDP2 (PyTorch native) — 5-line setup
# ──────────────────────────────────────────────────
from torch.distributed.fsdp import fully_shard, MixedPrecisionPolicy
from torch.distributed.device_mesh import init_device_mesh

mesh = init_device_mesh("cuda", (world_size,))
mp_policy = MixedPrecisionPolicy(param_dtype=torch.bfloat16,
                                  reduce_dtype=torch.float32)

# Apply FSDP2 to each transformer block, then the whole model
for block in model.transformer_blocks:
    fully_shard(block, mesh=mesh, mp_policy=mp_policy)
fully_shard(model, mesh=mesh, mp_policy=mp_policy)

# That's it — train normally with optimizer.step()


# ──────────────────────────────────────────────────
# DeepSpeed ZeRO-2 — JSON config + engine init
# ──────────────────────────────────────────────────
# ds_config.json:
# {
#     "train_batch_size": 256,
#     "gradient_accumulation_steps": 4,
#     "fp16": {"enabled": true, "loss_scale": 0},
#     "zero_optimization": {
#         "stage": 2,
#         "allgather_partitions": true,
#         "reduce_scatter": true,
#         "overlap_comm": true,
#         "contiguous_gradients": true
#     }
# }

import deepspeed

model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    config="ds_config.json",
    model_parameters=model.parameters(),
)
# Train with model_engine.backward(loss) + model_engine.step()


# ──────────────────────────────────────────────────
# Memory comparison: 1.3B-param model with Adam optimizer
# ──────────────────────────────────────────────────
# Component sizes (float16 params, float32 optimizer):
#   Parameters:  1.3B × 2 bytes  = 2.6 GB
#   Gradients:   1.3B × 2 bytes  = 2.6 GB
#   Adam states: 1.3B × 8 bytes  = 10.4 GB  (m + v in fp32)
#   Total baseline (DDP):          15.6 GB per GPU  (+ activations)
#
# Sharded across 8 GPUs:
#   ZeRO-1: 2.6 + 2.6 + 10.4/8 = 6.5 GB/GPU  (2.4× reduction)
#   ZeRO-2: 2.6 + (2.6 + 10.4)/8 = 4.2 GB/GPU (3.7× reduction)
#   ZeRO-3: (2.6 + 2.6 + 10.4)/8 = 2.0 GB/GPU (7.8× reduction)
#
# Note: activations add 5–20 GB depending on batch size and
# sequence length. Use activation checkpointing to trade
# compute for memory when needed.

The choice between FSDP2 and DeepSpeed often comes down to ecosystem. FSDP2 is native PyTorch, composes with DeviceMesh for 3D parallelism, and requires no external library. DeepSpeed offers ZeRO++ with quantized communication, CPU/NVMe offloading (ZeRO-Infinity), and more aggressive memory optimizations for extreme-scale training. If you're on PyTorch and don't need offloading, FSDP2 is the simpler choice. If you need to push the memory envelope further, DeepSpeed has more knobs.

For more on profiling the communication overhead of these strategies, see Profiling Python AI Code.

The Decision Framework — Choosing Your Strategy

After all those numbers, here's the practical takeaway. The decision tree for distributed training is surprisingly simple once you know your constraints:

Model fits on one GPU? → Use DDP. It's the simplest, fastest, and scales well to 8+ GPUs.
Fits with gradient checkpointing? → Use DDP + activation checkpointing. Trading recompute for memory is often cheaper than distribution overhead.
Doesn't fit, but each layer fits on one GPU? → Use FSDP2 / ZeRO-3. Shard parameters, gradients, and optimizer states across GPUs. This covers the 1B–15B parameter range on typical hardware.
A single layer doesn't fit? → Add tensor parallelism within the node. When attention heads or FFN layers are too large, TP splits them across NVLink-connected GPUs.
Very large model (tens of billions)? → 3D parallelism: TP within nodes, PP across node groups, FSDP across pipeline replicas.

Here's that decision tree as a function you can actually call:

def recommend_strategy(
    param_billions: float,
    gpu_mem_gb: float = 80.0,
    num_gpus: int = 8,
    interconnect: str = "nvlink",  # "nvlink", "infiniband", or "pcie"
) -> dict:
    """Recommend a distributed training strategy based on hardware constraints.

    Returns a dict with 'strategy', 'reason', and 'config_tips'.
    """
    # Static training memory: params(fp16) + grads(fp16) + adam(fp32)
    # = 2B + 2B + 8B = 12 bytes per parameter
    static_mem_gb = param_billions * 12

    # Activations typically need 3-5× the static memory for realistic
    # batch sizes, so we use conservative thresholds for GPU headroom.
    fits_one_gpu = static_mem_gb < gpu_mem_gb * 0.15
    fits_with_ckpt = static_mem_gb < gpu_mem_gb * 0.25

    # Per-layer memory: rough estimate — largest layer is ~4× param/num_layers
    # Assume 32 layers for simplicity
    largest_layer_gb = (param_billions * 2 * 4) / 32

    if fits_one_gpu:
        return {
            "strategy": "DDP",
            "reason": f"Static memory ~{static_mem_gb:.0f} GB fits in "
                      f"{gpu_mem_gb:.0f} GB with room for activations. "
                      f"DDP gives best throughput.",
            "config_tips": "torchrun --nproc_per_node=N, batch_size × N",
        }

    if fits_with_ckpt:
        return {
            "strategy": "DDP + Activation Checkpointing",
            "reason": f"Tight fit ({static_mem_gb:.0f} GB static). "
                      f"Checkpointing frees activation memory at "
                      f"~33% compute cost.",
            "config_tips": "torch.utils.checkpoint.checkpoint() on each block",
        }

    sharded_mem = static_mem_gb / num_gpus
    if largest_layer_gb < gpu_mem_gb * 0.5 and sharded_mem < gpu_mem_gb * 0.7:
        return {
            "strategy": "FSDP2 (ZeRO-3)",
            "reason": f"Static memory {static_mem_gb:.0f} GB total, "
                      f"~{sharded_mem:.1f} GB/GPU after sharding across "
                      f"{num_gpus} GPUs.",
            "config_tips": "fully_shard() each transformer block, then model",
        }

    if interconnect == "nvlink":
        return {
            "strategy": "TP (intra-node) + FSDP (across replicas)",
            "reason": f"Too large for FSDP alone ({sharded_mem:.0f} GB/GPU "
                      f"after sharding). TP splits layers across NVLink GPUs.",
            "config_tips": "DeviceMesh([tp_dim, dp_dim]), TP=4 + FSDP=2",
        }

    return {
        "strategy": "3D Parallelism (TP + PP + FSDP)",
        "reason": f"Large model ({param_billions}B) on {interconnect}. "
                  f"Full 3D parallelism needed.",
        "config_tips": "TP within node, PP across nodes, FSDP for replicas",
    }

# Test it
for size in [0.025, 0.125, 1.3, 7.0, 70.0]:
    r = recommend_strategy(size)
    print(f"\n{size}B params → {r['strategy']}")
    print(f"  Reason: {r['reason']}")
    print(f"  Config: {r['config_tips']}")

# Output:
# 0.025B params → DDP
#   Reason: Static memory ~0 GB fits in 80 GB with room for activations. ...
#   Config: torchrun --nproc_per_node=N, batch_size × N
#
# 0.125B params → DDP
#   Reason: Static memory ~2 GB fits in 80 GB with room for activations. ...
#   Config: torchrun --nproc_per_node=N, batch_size × N
#
# 1.3B params → DDP + Activation Checkpointing
#   Reason: Tight fit (16 GB static). Checkpointing frees activation memory ...
#   Config: torch.utils.checkpoint.checkpoint() on each block
#
# 7.0B params → FSDP2 (ZeRO-3)
#   Reason: Static memory 84 GB total, ~10.5 GB/GPU after sharding ...
#   Config: fully_shard() each transformer block, then model
#
# 70.0B params → TP (intra-node) + FSDP (across replicas)
#   Reason: Too large for FSDP alone (105 GB/GPU after sharding). ...
#   Config: DeviceMesh([tp_dim, dp_dim]), TP=4 + FSDP=2

One final nuance: hardware determines what's viable. NVLink at 600–900 GB/s makes tensor parallelism practical within a node. InfiniBand at ~50 GB/s makes pipeline parallelism and FSDP viable across nodes. PCIe at ~64 GB/s limits everything — if your GPUs are PCIe-connected, DDP is your only efficient option, and you should invest in better interconnect before investing in more GPUs.

Scaling batch size linearly with GPU count changes the effective learning rate. If you quadruple the batch size via 4-GPU DDP, consider using linear LR scaling or warmup to maintain convergence. See Learning Rate Schedules from Scratch for the details.

References & Further Reading

Huang, Cheng, Bapna, Firat et al. — "GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism" — NeurIPS 2019. The foundational pipeline parallelism paper with synchronous micro-batching.
Narayanan, Harlap, Phanishayee et al. — "PipeDream: Generalized Pipeline Parallelism for DNN Training" — SOSP 2019. Introduced 1F1B scheduling to reduce pipeline memory overhead.
Shoeybi, Patwary, Puri et al. — "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism" — arXiv 2020. The blueprint for tensor parallelism in transformers.
Rajbhandari, Rasley, Ruwase, He — "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models" — SC 2020. The three-stage sharding strategy that FSDP builds on.
Narayanan et al. — "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM" — SC 2021. Combines TP, PP, and DP into practical 3D parallelism.
Li et al. — "PyTorch Distributed: Experiences on Accelerating Data Parallel Training" — VLDB 2020. The engineering behind PyTorch DDP.
PyTorch TorchTitan — Reference implementation for 3D parallelism with FSDP2, TP, and PP on Llama models.

DadOps