Distributed Training in 2026: DDP, FSDP2, DeepSpeed, Megatron, and the Five Axes of Parallelism

Once you have more than one GPU in a box, you have a choice to make. Four open-source stacks dominate distributed training today — PyTorch DDP, PyTorch FSDP2, Microsoft DeepSpeed, and NVIDIA Megatron-Core — joined by TorchTitan as the new reference for native-PyTorch large-scale training. Each one solves a different problem. Picking the wrong one wastes weeks: either the model does not fit and you find out at step 30 with an out-of-memory crash, or you over-engineered for a workload DDP would have handled in a tenth the engineering time.

This article is the buyer-and-architect view of distributed training. It walks the five parallelism axes, the four (now five) frameworks that combine them, and the recipes that actually work on Kentino-class hardware — primarily 4× and 8× RTX 5090, 4090, and RTX Pro 6000 Blackwell on AMD EPYC hosts, with PCIe Gen5 and commodity Ethernet or InfiniBand between nodes. The honest punchline is at the top so the rest of the article does not need to keep proving it: most Kentino customers fine-tune, not pretrain, and FSDP2 on a single 8-GPU node handles 90% of fine-tuning needs. Megatron-Core and 3D parallelism only earn their keep when you are pretraining from scratch above the 30B scale, and that is a much smaller club than the marketing implies.

The five axes

Every modern distributed-training framework composes some subset of the same five parallelism axes. They are not alternatives — large-model recipes combine three or four at once. Read down the communication column; that is the entire game.

Axis What it splits Communication per step Bandwidth-hungry? Notes
Data parallel (DP) The batch — every rank has full model All-reduce of gradients once per step Moderate The baseline; DDP and FSDP both DP
Tensor parallel (TP) Each layer's matmul across GPUs All-reduce per layer × 2 (attn + FFN) Very heavy Loves NVLink; suffers on PCIe
Pipeline parallel (PP) Layers split across stages Activation at stage boundary Light Adds bubble latency, raises throughput
Sequence/context parallel (SP/CP) Sequence dimension across GPUs Ring all-gather of KV / activations Moderate–heavy Enables million-token training
Expert parallel (EP) MoE experts across GPUs All-to-all per MoE layer Bursty heavy MoE only

DP shouts once per step. TP shouts twice per layer. PP whispers once per stage. SP/CP shouts once per attention block but along a different dimension. EP does all-to-all only on the MoE-active layers. That single column decides where each axis lives — inside a fast box, or across a slower fabric.

The "3D parallelism" that comes up in every NVIDIA blog is DP × TP × PP. Add SP and EP and it becomes 4D or 5D, which is a notation game more than an architectural change. What matters is the layout: TP inside the node (where bandwidth is cheap), PP between nodes (where bandwidth is dear), DP on top to scale throughput. SP slots in alongside TP; EP only applies to MoE.

Cross-references: tensor-parallel intra-node bandwidth is unpacked in K07, the NVLink-versus-PCIe penalty in N03, and the inter-node fabric in N08.

DDP — when the model fits and you just want more throughput

PyTorch DDP (DistributedDataParallel) is the oldest, simplest, and still-correct answer when the model fits on one GPU. Every rank holds a full copy of the model and optimizer state. Each step, every rank does its own forward, backward, and gradient computation on its own micro-batch. Then a single all-reduce sums the gradients across ranks and everyone applies the same update.

DDP in PyTorch 2.x is the same DDP it has always been, with two improvements worth noting: a more robust static_graph=True path for compiler-friendly graphs, and tighter integration with torch.compile. The communication pattern is unchanged — one all-reduce per step, overlap with backward, scale linearly until the all-reduce dominates.

When DDP is the right answer:

  • The full model + optimizer states + activations fit on one GPU. For Adam-style optimizers on FP32 master weights, the rule of thumb is roughly 16 bytes per parameter of memory for the trainable state alone (4 bytes weights, 4 bytes gradients, 8 bytes optimizer) plus activations. An 8B model = 128 GB of state, which fits on one RTX Pro 6000 Blackwell (96 GB) only with mixed precision and BF16 master weights. Below 8B parameters DDP is comfortable on a single 5090; above it, look at FSDP2.
  • You want maximum throughput, not maximum model size. DDP has the lowest communication-to-compute ratio of any data-parallel approach, because gradients are reduced once per step, after all the local backward work.
  • You are doing reinforcement-learning rollouts, LoRA, or any setup where each rank holds a small trainable adapter on top of frozen base weights.

When DDP is wrong: the model does not fit. The fix is not "smaller batch." The fix is FSDP2.

torchrun --standalone --nproc-per-node=8 train.py

That is the whole launcher. DDP is the boring default everyone forgets about, and for the right workload it is the fastest thing you can run.

FSDP2 — the new default for everything DDP can't handle

FSDP (Fully Sharded Data Parallel) shards the model parameters, gradients, and optimizer states across the data-parallel group. Each GPU stores 1/N of the parameters. To do a forward pass, the rank all-gathers the layer it currently needs, runs the compute, and discards the gathered weights. Memory drops roughly N-fold compared to DDP; communication rises because every layer is gathered and re-gathered.

The 2025 story is FSDP2. The original FSDP (FSDP1) wrapped groups of parameters into a single FlatParameter, which made reasoning about per-parameter behavior — partial freezing, mixed dtypes, parameter-wise optimizer settings — painful or impossible. FSDP2 rewrote the internals on top of DTensor: every tensor remains a real torch.Tensor that happens to be sharded along its dim-0 across ranks. The user-facing API changed from FSDP(model, ...) to fully_shard(model, ...).

What FSDP2 actually delivers over FSDP1, from the published benchmarks and our own runs:

  • ~7% lower GPU memory on Llama 2 7B at the same configuration, because FSDP2 avoids the record_stream pattern that pinned memory pessimistically.
  • ~1.5% throughput gain at parity, and up to 50% throughput speedup when combined with torch.compile and torchao float8 training on Hopper-class hardware.
  • Sharded state dicts that load fast and re-shard cleanly across different parallelism layouts — the FSDP1 checkpoint format was famously hard to reshape between training and inference.
  • Partial parameter freezing without acrobatics, which matters for LoRA and adapter training.

FSDP1 is deprecated as of PyTorch 2.11. New work should use fully_shard. The old FullyShardedDataParallel wrapper still exists for compatibility but is on a removal track.

FSDP2 also exposes two sharding strategies that decide where the model lives:

  • Full shard. Parameters fully sharded across all ranks (the FSDP1 default). Lowest memory, highest communication.
  • Hybrid shard. Parameters sharded within a node and replicated across nodes. Inside a node, communication is over PCIe/NVLink (fast). Between nodes, only the gradient all-reduce crosses the slow fabric. This is the sweet spot for 2–4 nodes on 100/200 Gbps Ethernet/IB.

When FSDP2 is the right answer (the modal Kentino case):

  • 8B–70B fine-tuning on one 8-GPU node. Full shard, BF16, gradient checkpointing on, torch.compile. Done.
  • 70B–405B fine-tuning across 2–4 nodes. Hybrid shard with full-shard inside each node, replicated across.
  • Anything LoRA / QLoRA — FSDP2's partial-parameter handling beats FSDP1 here outright.
from torch.distributed.fsdp import fully_shard, MixedPrecisionPolicy

mp = MixedPrecisionPolicy(param_dtype=torch.bfloat16, reduce_dtype=torch.float32)
for block in model.transformer_blocks:
    fully_shard(block, mp_policy=mp)
fully_shard(model, mp_policy=mp)

Launched the same way as DDP, via torchrun. The wrapper is the only thing that changes.

DeepSpeed and the ZeRO levels — still around, no longer the default

DeepSpeed is Microsoft's distributed training stack. Its claim to fame was ZeRO (Zero Redundancy Optimizer), which arrived years before FSDP and defined the modern shard-everything approach.

Level What is sharded Memory savings vs DDP Communication vs DDP
ZeRO-1 Optimizer states ~4× Same
ZeRO-2 Optimizer states + gradients ~8× Slightly more
ZeRO-3 Optimizer states + gradients + parameters Linear in N ~1.5× DDP

ZeRO-3 is architecturally equivalent to FSDP full-shard. They solve the same problem with the same communication primitives.

The reality in 2025–2026: FSDP2 has eaten DeepSpeed's lunch on the dense-LLM use case. PyTorch native, no extra package, integrated with torch.compile, the same recipe ports across Hugging Face Transformers, Accelerate, and TorchTitan. Internal benchmarks from Lightning and Hugging Face show FSDP full-shard running 2–5× faster per iteration than ZeRO-3 in some setups, although DeepSpeed claws back ground on very large models (10B+) where its CPU-offload and NVMe-offload paths are still genuinely useful.

DeepSpeed remains worth knowing about for three reasons:

  1. ZeRO-Infinity offload. If you must fine-tune a model that does not fit in aggregate GPU memory at all — say a 405B base on a 4× RTX Pro 6000 Blackwell box — DeepSpeed can offload parameters to CPU RAM (cheap, slow) and to NVMe (cheaper, much slower). FSDP has CPU offload too, but DeepSpeed's NVMe path is more mature. Useful for "I have one machine and a stubborn model"; not the right answer if you can rent or buy a second node.
  2. DeepSpeed-Ulysses sequence parallelism. A communication-efficient sequence-parallel scheme that uses all-to-all instead of ring all-gather for attention. Demonstrated up to 1M-token contexts on 64 A100s, and training Llama-8B at 15M tokens of context on 32 H100s in published 2025 work. If you are specifically chasing very long context training, Ulysses is still ahead of the Megatron context-parallel implementation for some shapes.
  3. DeepSpeed-MoE. Mixture-of-experts training with expert parallel. Less relevant for fine-tuning, very relevant if you are pretraining MoE.

For the bulk of Kentino-customer fine-tuning, the right call is FSDP2 unless you have a specific reason to reach for DeepSpeed (long context, CPU/NVMe offload, MoE pretraining). The ecosystem momentum is unambiguously toward FSDP2.

Megatron-LM, Megatron-Core, NeMo — where the heavy iron lives

Megatron started as NVIDIA's tensor-parallel Transformer paper in 2019. Today the family has three layers:

  • Megatron-LM — the original research codebase. Still used; still updated.
  • Megatron-Core — the modular library version. Composable building blocks for TP/PP/DP/EP/CP, mixed precision (FP16/BF16/FP8/FP4), and reference transformer architectures. The thing you actually integrate.
  • NVIDIA NeMo — the end-to-end framework built on Megatron-Core. Recipes, data pipelines, alignment, deployment.

Megatron-Core is the framework that wins at the very top end, specifically because it implements tensor parallelism, pipeline parallelism, sequence parallelism, context parallelism, and expert parallelism in one composable mesh. When you are training a 405B dense model on 512+ GPUs, you cannot avoid combining at least three of those, and Megatron-Core is the most-deployed, most-tested combination.

The Megatron parallelism guidance for 2026, from NVIDIA's own documentation, lines up with the hardware reality:

Hardware Recommended primary axes
Single node, NVLink TP up to 8 within the node
Multiple nodes, InfiniBand NDR TP within node, PP across nodes
Limited network (Ethernet) Minimize TP, prefer PP for cross-node
Long sequences Add CP; enable SP wherever TP is on

That table tells you why Megatron exists. It is the framework whose authors live with the bandwidth math we describe in N03 and K07, and its recipes are tuned for it.

Where Megatron is overkill: any single-node fine-tune, any model that fits in FSDP2-friendly memory, any workload where you do not need TP. Megatron's TP machinery is genuinely faster than DIY TP on PCIe, but TP on PCIe is still slow — see the K07 numbers. Megatron's strength is on SXM hardware that Kentino does not build.

Where Megatron is the right answer: pretraining 70B–405B+ dense models on rented or owned NVLink-class hardware, or building production training infrastructure for a research lab. If that is you, you are probably already in the NeMo ecosystem.

TorchTitan — the new reference

TorchTitan is Meta's PyTorch-native large-scale training reference, accepted at ICLR 2025 and now the de facto example for "what should a TP × PP × FSDP2 × CP recipe look like in 2026?" It does not invent new parallelism — it composes the building blocks PyTorch already ships (fully_shard, torch.distributed.tensor.parallel, pipelining, DTensor) into a clean, four-dimensional parallel training script with async sharded checkpointing, torch.compile, and float8.

Why it matters even if you do not use it directly:

  • It is the canonical example of how FSDP2 composes with TP and PP without a third-party framework.
  • The same APIs ship in stock PyTorch. There is nothing magic.
  • AMD shipped an optimized TorchTitan fork for ROCm in late 2025; the Lightning AI partnership announced in October 2025 makes TorchTitan recipes runnable on Lightning Studios.

For Kentino-class customers, TorchTitan is a reference to read more than a framework to deploy. If you are fine-tuning, Accelerate or Axolotl over FSDP2 is more ergonomic. If you are pretraining at modest scale (8–64 GPUs) on commodity hardware, TorchTitan is competitive with NeMo and a lot less heavy operationally.

The framework matrix

Framework Parallelism axes Maintained? Best for
PyTorch DDP DP Yes, stable Model fits per-GPU; max throughput
PyTorch FSDP1 DP (sharded) Deprecated 2.11 Don't start here
PyTorch FSDP2 DP (sharded), composes with TP/PP/CP Yes, active Modal fine-tuning answer in 2026
DeepSpeed ZeRO DP (sharded), CPU/NVMe offload Yes, active Offload-heavy, very long context (Ulysses), MoE
Megatron-Core / NeMo TP, PP, SP, CP, EP, DP Yes, very active 70B+ pretraining, SXM/NVLink clusters
TorchTitan FSDP2 + TP + PP + CP + float8 Yes, reference Modern pretraining on PyTorch-native stack
HF Accelerate Wrapper around DDP/FSDP/DS Yes, active Easy launcher, abstracts the backend
Axolotl Wrapper around Accelerate/FSDP Yes, active Fine-tuning, datasets, recipes for LoRA/QLoRA

Accelerate and Axolotl are not separate parallelism strategies — they wrap the backends above. Most Kentino fine-tuning customers will end up using Axolotl over FSDP2 without thinking about it, and that is correct.

The communication budget — why your network limits the model

Cross-reference: N08 walks through the RDMA and uplink math; this is the training-specific view.

For data parallel (DDP or FSDP), the per-step inter-rank communication is roughly proportional to the parameter count (gradients to reduce). For tensor parallel, it is proportional to the activation size × layer count — orders of magnitude larger per step.

Rule-of-thumb gradient volume for a single step at BF16:

Model size Gradient bytes/step All-reduce time at 25 GB/s (NDR HDR-class) At 12.5 GB/s (100 GbE)
8B 16 GB ~0.6 s ~1.3 s
70B 140 GB ~5.6 s ~11 s
405B 810 GB ~32 s ~65 s

The 70B all-reduce alone is 5.6 seconds on a 200 Gbps fabric. If your forward+backward compute on one step is also 5 seconds, you are already 50% communication-bound; if compute is 2 seconds, you are 70%+ idle on the network. This is why 100 GbE chokes 70B+ training and you need 400 GbE / NDR IB. Compute keeps getting faster; network has to keep pace or the GPUs idle.

FSDP2 hides some of this with overlap (start gathering the next layer while computing the current). Hybrid shard hides more by keeping the gradient all-reduce inside the fast intra-node fabric and only reducing replicated gradients across nodes. The numbers above are worst-case for full-shard FSDP across nodes without overlap.

For tensor parallel, the per-token communication scales with hidden size and layer count. The numbers in K07 show why: ~80 MB per generated token during decode, ~300 GB per prefill on 70B at batch 32. This is the regime where PCIe (50 GB/s realistic) is roughly 14× slower than NVLink (700+ GB/s realistic), and where pretraining a 70B+ dense model on PCIe simply does not work at acceptable efficiency.

Real recipes

8B Llama fine-tune, one 8-GPU node — FSDP2

Hardware: 8× RTX 5090 or 8× RTX Pro 6000 Blackwell, EPYC host, no inter-node fabric needed.

torchrun --standalone --nproc-per-node=8 \
  finetune.py \
    --model meta-llama/Llama-3.1-8B \
    --batch-size 1 --grad-accum 16 --seq-len 4096 \
    --bf16 --fsdp full_shard --fsdp-reshard-after-forward \
    --gradient-checkpointing --torch-compile

Expected: ~2000 tok/s aggregate at BF16, fits comfortably with KV/activations, full-shard FSDP2 over PCIe Gen5 handles the gradient all-reduce inside the box. This is also what 90% of Kentino fine-tuning jobs look like.

70B Llama fine-tune, 2× 8-GPU nodes — FSDP2 hybrid shard

Hardware: 2× 8× RTX Pro 6000 Blackwell, 200 Gbps IB or 100 GbE RoCE between nodes.

torchrun --nnodes=2 --nproc-per-node=8 \
  --rdzv-backend=c10d --rdzv-endpoint=node0:29500 \
  finetune.py \
    --model meta-llama/Llama-3.3-70B \
    --batch-size 1 --grad-accum 32 --seq-len 4096 \
    --bf16 --fsdp hybrid_shard \
    --gradient-checkpointing --torch-compile \
    --activation-cpu-offload

Hybrid shard keeps full-shard inside each 8-GPU node and replicates across the 2 nodes. The inter-node fabric only carries the reduce-scatter of replicated gradients — roughly 70 GB/step at BF16, ~3 s on 200 Gbps IB. Overlap hides most of it. At 100 GbE the same step costs ~5–6 s and the GPUs start showing idle time; the recipe still works, just slower per epoch.

405B pretraining, 8× 8-GPU nodes — Megatron-Core 3D

Hardware: 8× 8-GPU NVLink/SXM nodes, 400 Gbps NDR IB. Not Kentino-built — for completeness.

# Megatron-Core launcher (abbreviated)
torchrun --nnodes=8 --nproc-per-node=8 \
  pretrain_gpt.py \
    --tensor-model-parallel-size 8 \
    --pipeline-model-parallel-size 8 \
    --sequence-parallel \
    --context-parallel-size 1 \
    --num-layers 126 --hidden-size 16384 --num-attention-heads 128 \
    --seq-length 8192 --micro-batch-size 1 --global-batch-size 2048 \
    --bf16 --use-flash-attn --transformer-impl transformer_engine

TP=8 × PP=8 = 64 ranks per replica; 8 nodes × 8 GPUs = 64 GPUs total, exactly one replica. To scale to multiple replicas for higher throughput, multiply by DP. This is where Megatron earns its keep. The same model on FSDP2 alone would spend most of its time on inter-rank communication; the 3D layout puts the heaviest communication (TP) inside the NVLink domain.

Honest take for Kentino customers

Most Kentino customers are not pretraining. They are fine-tuning open-weight base models — Llama, Qwen, Mistral, sometimes Gemma — on their domain data, occasionally with LoRA or QLoRA, occasionally full fine-tunes. For that work, the framework choice is:

  • One model that fits on one GPU at BF16, with optimizer state. Use DDP. Replicate copies for throughput.
  • One model that does not fit on one GPU but fits in aggregate on one 8-GPU node. Use FSDP2 full shard. This covers 8B–70B fine-tunes.
  • One model that fits in aggregate across two 8-GPU nodes. Use FSDP2 hybrid shard. 70B–200B fine-tunes.
  • A model larger than that, or pretraining from scratch. This is the conversation where Megatron-Core, SXM hardware, and NDR IB belong. We will build the storage and the management plane, but the GPUs are usually rented for that phase.

The 10% of customers who genuinely need multi-node, tightly-coupled training are mostly research labs with funded pretraining programs, and they know what they need before they call. The other 90% should not be sold a cluster — they should be sold one well-spec'd 8-GPU node with the right fabric stub for a future second node, and the network gear left as a P2 line item.

What to do next

If you are sizing a training job and have not picked a framework yet, walk these in order:

  1. Compute the per-rank memory at your target precision. Parameters × 2 (BF16) + gradients × 2 + optimizer × 8 (Adam FP32 states) + activations. If that fits in your GPU's VRAM with headroom, DDP is your answer.
  2. If memory does not fit, ask: do I have one node or many? One node → FSDP2 full shard, end of conversation. Many nodes → FSDP2 hybrid shard.
  3. Run a single-step sanity check. Watch the all-reduce time in NCCL_DEBUG=INFO. If it dominates the step, the network is undersized for the model. Going to a smaller model or a bigger fabric are your two options; tuning the framework will not save you.
  4. Only reach for Megatron-Core or DeepSpeed when FSDP2 cannot do what you need. "Cannot" means: need tensor parallel for a model bigger than your aggregate node memory, need CPU/NVMe offload, need sequence-parallel above what --context-parallel-size in PyTorch gives you, or are pretraining MoE.
  5. Use Axolotl or Accelerate for the launcher. Hand-rolling FSDP2 wrapping is a learning exercise; in production you want the framework that handles the dataset, the tokenizer, the checkpoint format, and the LoRA plumbing. Both ride on FSDP2 underneath.
  6. Checkpoint with torch.distributed.checkpoint (DCP) or NeMo's async sharded checkpoint. Synchronous unsharded checkpoints written to NFS are a 2022 pattern; in 2026 they are a self-inflicted training stall. See K06 for the failure handling that hangs off this.
  7. Be honest about the size of the cluster you actually need. If your job runs on one 8-GPU node in a reasonable wall-clock, do not pay for a four-node cluster to make it 2.5× faster on paper. The scaling math in K07 shows why "more nodes" stops helping early on commodity hardware.

Companion articles: distributed training storage and checkpointing in K04, inference clustering in K03, failure handling in K06, the PCIe bandwidth ceiling in K07, and the NVLink-versus-PCIe trade-off in N03. The networking math is in N08.


This is part of the Kentino Wiki, a reference series on AI compute, robotics, and the systems that connect them. Comments and corrections welcome at info@kentino.com.