Failure Handling in AI Clusters: What Actually Breaks and How to Recover

Distributed training is the one workload where hardware failures are not a rare nuisance — they are an operational tax you pay continuously. Meta's published Llama 3.1 405B post-mortem records 419 unexpected interruptions over 54 days on a 16,384-GPU cluster: one event every three hours, with GPU and HBM faults responsible for roughly half. That is the steady-state experience of running thousands of GPUs hard.

Most Kentino customers will never see numbers like that. Single-node 8-GPU fine-tunes are statistically quiet. But the failure modes, diagnostic tools, and recovery patterns are the same. This article is the honest catalogue: what fails, how you notice, what you do about it, and where the engineering effort is actually worth it at our scale.

The actual failure modes

Two categories matter — hardware events (GPU, PSU, NIC, disk does something physical) and software events (CUDA, NCCL, the framework, the OS reacts badly to a transient). Below is in rough order of how often each one bites on a multi-GPU workstation or small cluster.

GPU XID errors

The kernel logs (dmesg, journalctl -k) are the source of truth. NVIDIA emits an XID line on any GPU fault. The ones you actually see:

XID	Meaning	What it really means
13	Graphics Engine Exception	App bug, illegal memory access — usually CUDA OOM
31	GPU memory page fault	App bug or driver issue, occasionally bad VRAM
43	Stopped processing	App-side problem, GPU is fine
48	Double-bit ECC error	Hardware. Memory cell is gone, GPU should be retired
63	ECC page retirement / row remapping pending	Hardware degrading. Schedule replacement
74	NVLink error	Cable, riser, or board fault
79	GPU has fallen off the bus	Power, PCIe, riser, or thermal kill
92	High single-bit ECC error rate	Hardware degrading
94	Contained ECC error (Hopper-class)	Single workload killed, GPU keeps running
119	GSP RPC timeout	Driver/firmware issue, often resolved by a reboot

Two notes from experience:

XID 79 is the one customers panic-call about. "The GPU disappeared." On a 4× or 8× riser build, XID 79 is almost always a PCIe riser problem, a power connector that backed out under thermal cycling, or a thermal shutdown — not a dead GPU. Re-seat, re-cable, retest before RMA.
XID 48 and 63 are real. ECC defects creep up over months on heavily-used cards. The GPU retires pages automatically until it runs out of spare rows. After that the card is unsafe for training; most operators replace it.

CUDA out-of-memory mid-run

The single most common training failure on our hardware, and almost always the operator's fault, not the hardware's. Typical pattern: training runs fine for 200 steps, then crashes with CUDA error: out of memory. The cause is usually:

Activation memory grows with sequence length — longer samples later in the epoch blow the budget.
A peer process on the same GPU. nvidia-smi shows two PIDs; one was supposed to be killed and was not.
Memory fragmentation. PyTorch's caching allocator refuses a large contiguous block even with adequate free memory. Fix: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True.

Doubling the GPU does not help. The right answer is reducing micro-batch size, enabling gradient checkpointing, or sharding the optimizer (see K02).

NCCL timeouts and network glitches

NCCL collectives (all_reduce, all_gather, reduce_scatter) are synchronous across all ranks. If one rank stalls — bad NIC, congested switch, kernel scheduler hiccup, a single slow GPU — every other rank waits at the next collective and the whole job blocks. Without async error handling the stall is silent; the job appears to "hang" until a watchdog timeout fires (default 30 minutes).

The fix is one environment variable:

export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
export TORCH_NCCL_TRACE_BUFFER_SIZE=20000  # for post-mortem analysis

PyTorch then aborts with SIGABRT on a timed-out collective. Combined with torchelastic, the job restarts from the last checkpoint. Note: the older NCCL_ASYNC_ERROR_HANDLING is deprecated.

Node disconnect and PSU failures

On multi-node clusters a node can drop from a NIC reset, switch port flap, kernel panic, OOM-killer hit, or power event. Detection is the same as NCCL timeout. For single-node 8-GPU builds — most of our customers — this category does not apply.

A dead PSU is a hard fail. On an 8-GPU server with dual ATX PSUs in split-rail configuration, a PSU failure does not equal redundancy. PSU A powers GPUs 0–3, PSU B powers GPUs 4–7. Lose PSU B and four GPUs vanish to XID 79 within milliseconds. Recovery means physical replacement. Real redundancy requires CRPS units in 1+1 hot-swap, which is a server-class build, not a consumer-GPU workstation.

Storage write failures during checkpoint

Less common but painful. The job runs ten hours, hits a checkpoint, and the write fails because the NFS server is full, the local NVMe is over its DWPD allocation and entering read-only mode, an inode limit was hit, or permissions changed. Damage is proportional to checkpoint interval; if you only catch it on the next attempt you can lose an hour or more.

Slow ECC leaks

The quiet killer. A GPU starts emitting XID 92 (single-bit ECC) once a week, then daily, then hourly. Each event is "contained" and the job keeps running, but accuracy drifts and training loss develops a slow upward bias. By the time anyone notices, hundreds of pages are retired and the card is heading for XID 48. This is why the monitoring section matters more than the recovery section.

Detection — what to watch

Three layers: kernel-level (XID lines in dmesg), GPU vendor (DCGM exporter to Prometheus, dcgmi diag for active checks), and framework-level (PyTorch watchdog, NCCL trace buffer, loss/throughput alerts).

The alerts that matter:

Any XID 48 / 63 / 79 / 92 line in dmesg → page
GPU temp > 85 °C for more than 5 minutes → page
ECC volatile error count incrementing → ticket, not page
DCGM dcgm_thermal_violation non-zero → cooling problem, check airflow
Training loss not decreasing for >100 steps → worth a look

Real failure log examples

What you actually see in journalctl -k when XID 79 fires (RTX 5090, riser cable backed out under thermal cycling):

Apr 22 03:14:17 kentino-ai-04 kernel: NVRM: Xid (PCI:0000:c1:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Apr 22 03:14:17 kentino-ai-04 kernel: NVRM: GPU 0000:c1:00.0: GPU is on Board .
Apr 22 03:14:17 kentino-ai-04 kernel: NVRM: A GPU crash dump has been created. If possible, please run
Apr 22 03:14:17 kentino-ai-04 kernel: NVRM: nvidia-bug-report.sh as root to collect this data before
Apr 22 03:14:17 kentino-ai-04 kernel: NVRM: the NVIDIA kernel module is unloaded.

What an NCCL timeout looks like, abbreviated:

[E ProcessGroupNCCL.cpp:475] [Rank 3] Watchdog caught collective operation timeout:
WorkNCCL(SeqNum=842, OpType=ALLREDUCE, NumelIn=268435456, NumelOut=268435456,
Timeout(ms)=1800000) ran for 1800321 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] [Rank 3] Some NCCL operations have failed or timed out.
terminate called after throwing an instance of 'std::runtime_error'

Rank 3 is reporting it, but rank 3 is almost never the actual problem. The slow rank is the one that did not arrive. The NCCL trace buffer dumps which ranks were blocked at which call — that is how you find the real culprit.

Recovery patterns

Checkpoint frequently enough that a restart is cheap

For a 7B fine-tune on 8× RTX 5090 with local NVMe checkpoints: ~14 GB written at ~3 GB/s ≈ 5 s per checkpoint, every 30 minutes is 0.3% overhead and worst-case loss is 15 minutes. For a 70B FSDP sharded checkpoint on the same box: ~140 GB sharded across 8 GPUs in parallel, similar overhead. Cheap, do it.

A full Llama-70B pretrain checkpoint over network storage runs ~520 GB and can take 20+ minutes; that is where Meta-class shops introduce asynchronous tiered checkpointing (fast local write, slow drain to durable storage). At Kentino size: checkpoint every 15–30 minutes to local NVMe, sync to NAS at run end. Anything more elaborate is over-engineering.

Automatic restart with torchelastic

torchrun supports elastic training out of the box:

torchrun \
    --nnodes=1:1 \
    --nproc-per-node=8 \
    --max-restarts=3 \
    --rdzv-backend=c10d \
    --rdzv-endpoint=localhost:29500 \
    train.py --checkpoint-dir /mnt/nvme/ckpt

With TORCH_NCCL_ASYNC_ERROR_HANDLING=1, the chain is: NCCL collective hangs → PyTorch watchdog raises → process aborts with SIGABRT → torchelastic kills sibling ranks and restarts from latest_checkpoint.pt. Total sequence is usually 60–120 seconds. Transient blips ride through. A permanent fault (GPU went XID 79) burns the --max-restarts budget and then a human is in the loop.

The operational practice

The biggest lever on failure cost is not the recovery code — it is what you do before the job starts.

Pre-flight checks

Before launching anything that will run more than a couple of hours, run the validation sequence:

# 1. Per-GPU stress test — catches silent thermal/ECC issues
gpu-burn 600                        # 10 minutes per GPU at full load

# 2. DCGM diagnostic — finds latent hardware faults
sudo dcgmi diag -r 3                # level 3 = thorough, ~15 min

# 3. NCCL fabric test — validates inter-GPU bandwidth on the box
mpirun -np 8 ./build/all_reduce_perf -b 8 -e 1G -f 2 -g 1

# 4. PyTorch dry-run — one step, full batch size, all ranks
torchrun --nproc-per-node=8 dryrun.py

Check	Catches
`gpu-burn`	Thermal throttling, silent ECC errors that show up only under load
`dcgmi diag`	PCIe link width regressions, power issues, memory errors, NVLink
`nccl-tests`	Bad riser, slow PCIe lane, broken NVLink bridge, misconfigured switch
dry-run	OOM, code bugs, dataloader hangs, wrong tokenizer

On a clean 8× RTX 5090 server, all_reduce_perf should land in the 50–80 GB/s bus bandwidth range depending on PCIe topology. Significantly below that means a riser or topology problem — fix it before training. We run gpu-burn for a full hour as part of pre-shipment QA on every build that leaves Kentino.

Watch for slow leaks

Real codebases accumulate memory: a logging hook holds a tensor reference, an exception path forgets to clear a buffer, an LR scheduler leaks a closure. The result is OOM at step 5000 of a 10000-step run. Cheapest mitigation: print torch.cuda.memory_allocated() every N steps to the training log. If it is growing, it should not be.

The statistical reality at Kentino scale

This is where being honest about size matters. Failure rates scale with GPU-hours; large clusters fail constantly because they have many GPUs running for many hours.

Configuration	GPU-hours/month	Expected hardware events/month
Single workstation, 4 GPU	2,880	~0.05 — i.e. one every 1–2 years
Single server, 8 GPU	5,760	~0.1 — once a year if heavy use
Small cluster, 32 GPU	23,040	~0.5 — once every 2 months
Small cluster, 32 GPU 24/7	Full duty cycle	~1/month
Hyperscale, 16,384 GPU	~12M	232/month (Meta Llama 3)

Estimates calibrated against the Meta data (one fail per ~50,000 GPU hours observed). Real numbers vary by GPU model — consumer 4090s and 5090s with risers fail more often than datacenter L40s in their native chassis, mainly from PCIe and power-connector wear.

Takeaway: at single-node 8-GPU scale, expect one hardware event per year of heavy use, not per week. Most fine-tunes complete without seeing any. Customers who do see them are running training continuously, and the dominant failure mode is the riser/PSU side, not the GPU silicon.

Cost of a restart

Restart cost is lost training time plus diagnostic engineer time. On an 8× RTX 5090 build at, say, €300/day amortized, an extra 30 minutes is €6. The engineer time to diagnose, re-seat a cable, and re-launch is one to three hours. The compute loss is a rounding error; the labor is the cost. Which is why the right investment is monitoring and pre-flight, not exotic recovery infrastructure.

Most failure-handling content online is written for the hyperscale tier. At Kentino size, most of it is over-engineering. The 80/20 is: monitor kernel logs, checkpoint to local NVMe, set TORCH_NCCL_ASYNC_ERROR_HANDLING=1, run gpu-burn before big jobs.

What to do next

If you are operating a multi-GPU box at Kentino scale, the concrete actions:

Set up DCGM exporter + a minimal Prometheus alert on ECC errors and thermal violations. Half a day; catches 80% of slow-creeping hardware issues.
Add TORCH_NCCL_ASYNC_ERROR_HANDLING=1 and --max-restarts=3 to your launch script today. Zero effort, prevents the overnight hang.
Pick a checkpoint interval by job length. Under 2 hours: don't bother. 2–24 hours: every 30 minutes to local NVMe. Multi-day: every 15 minutes with periodic sync to durable storage.
Run gpu-burn and dcgmi diag -r 3 after delivery, before the first real workload. Catches DOA cards and shipping damage. Re-run quarterly.
Read the riser and PSU articles before you blame the GPU. On consumer-GPU servers, the riser and PSU rail are the failure source twice as often as the GPU die.

Companion articles: K02 (distributed training and checkpoint formats), K04 (cluster storage), N06 (latency dissection).

This is part of the Kentino Wiki, a reference series on AI compute, robotics, and the systems that connect them. Comments and corrections welcome at info@kentino.com.

Item added to your cart