Latency Dissection: Where Every Microsecond Goes in an AI Cluster Network

People sizing AI cluster networks usually start with bandwidth — 100, 200, 400 GbE — and are then surprised when their allreduce benchmark prints a number nowhere near line rate. The reason is almost always latency, and the small-message regime where bandwidth charts are useless.

This article takes a single round trip apart and accounts for every step. The numbers below are the consensus range from public measurements, NVIDIA/Mellanox documentation, and our own benches. Treat them as a budget, not a guarantee — they move with NIC silicon, switch ASIC, kernel version, BIOS settings, and how patient you are with tuning.

Audience: people specifying or troubleshooting AI cluster fabrics. Not a tuning cookbook, but the mental model that makes the cookbook readable.

The full round trip, one slide

A 64-byte ping-pong between two GPUs on different nodes, over a 100 GbE switched fabric with RDMA, breaks down roughly like this:

Component	Typical contribution	Notes
Application post / serialization	0.1–1 µs	memcpy, header build, descriptor write
NIC TX path (RDMA verbs)	0.3–0.8 µs	sub-µs on modern ConnectX / BlueField
Wire delay	~5 ns/m	50 ns over a 10 m DAC, negligible <100 m
Switch hop (cut-through, modern)	250–600 ns	InfiniBand <100 ns; Ethernet 400–600 ns
Switch hop (store-and-forward, 64 B)	1–3 µs	Add per-byte serialization on top
Per added hop	+switch latency	Leaf-spine adds 2 hops vs single switch
NIC RX path	0.3–0.8 µs	Symmetric with TX on RDMA
Kernel TCP/IP stack (one way)	10–30 µs	Sockets path; with interrupts, copies
Kernel bypass (DPDK / RDMA verbs, one way)	<1 µs	User-space polling, zero copy
Application receive / deserialization	0.1–1 µs	Wake the consumer, decode header

A single-switch RDMA RTT on 100 GbE / NDR InfiniBand lands at ~2 µs for small messages. A two-hop leaf-spine adds ~1 µs. The same path over kernel TCP/IP lands at 20–60 µs, an order of magnitude more, with worse tail behavior. Those two numbers drive almost every architectural decision below.

App TX

memcpy, descriptor
0.1–1 µs

→NIC TX

Switch

cut-through
250–600 ns

→NIC RX

App RX

decode, wake consumer
0.1–1 µs

RDMA one-way path: the switch hop (250–600 ns on Ethernet, <100 ns on IB) is the dominant variable; wire delay at <100 m is negligible.

The wire is not the problem

Light in fiber travels at ~two-thirds c, so propagation is ~5 ns/m. A 30 m row-of-racks fiber adds 150 ns one-way; even a 500 m campus run is one microsecond. Serialization is also small at modern speeds: a 64-byte frame at 100 Gb/s is 5.12 ns; a 9000-byte jumbo frame is 720 ns. For small control messages, serialization is negligible; for bulk transfer it stops mattering once the pipe is wide enough.

Latency actually lives in software, in switch ASICs, and in the protocol you choose.

Switch fabric: cut-through vs store-and-forward

A cut-through switch starts forwarding as soon as it has read the destination header. A store-and-forward switch buffers the entire frame, optionally validates the CRC, then forwards.

Switch class	Per-hop latency
InfiniBand NDR/HDR switch (cut-through)	<100 ns
Ethernet cut-through (Tomahawk-class, AI fab)	400–600 ns
Ethernet cut-through (older / generic)	600–900 ns
Ethernet store-and-forward, 64 B	1–3 µs
Ethernet store-and-forward, 1500 B	1.5–4 µs

The InfiniBand advantage is real: HPC silicon has been optimizing this for two decades. Modern Ethernet AI switches (Tomahawk 5, Jericho3-AI, Spectrum-X) close most of the latency gap and exceed InfiniBand on buffer depth, which matters more for allreduce incast.

In a leaf-spine, a cross-rack flow traverses leaf → spine → leaf. Designs that minimize hop count (shallow fat-tree, rail-optimized topologies) are not trying to save bandwidth; they are saving 500 ns per avoided hop on every collective.

The kernel network stack

Plain TCP/IP through the Linux kernel pays per packet for: NIC interrupt dispatch (or NAPI polling), skb allocation and softirq, TCP/IP processing, a socket-buffer copy between kernel and user space, and a context switch into the application.

For a small packet on modern x86, this is 10–30 µs one way in the best case, with tail spikes above 100 µs under load. It is also CPU-bound — saturates a core at ~1 Mpps long before the NIC does. That is the cost the rest of the article is trying to eliminate.

Kernel bypass: DPDK, RDMA, XDP, AF_XDP

Four mainstream ways to get the kernel out of the data path, differing in how completely they bypass and what they leave on the table:

Path	Latency floor	CPU model	Compatibility	Typical use
Kernel TCP/IP	10–30 µs/way	Interrupts	Universal	Anything not latency-critical
AF_XDP	6–10 µs/way	Hybrid	Linux tools still work	Middle ground; eBPF programs
DPDK	1–3 µs/way	Busy poll	Kernel sees nothing	Telco, HFT, NFV, custom packet pipelines
RDMA verbs	<1 µs/way	Queue pair / CQE	Needs RDMA NIC + fabric	HPC, AI training, storage networking

A few practical notes:

DPDK and RDMA are not interchangeable. DPDK runs arbitrary packet processing in user space at line rate; RDMA implements a specific memory-semantic protocol with hardware offload. AI workloads want RDMA — NCCL and storage stacks speak it natively. DPDK shows up in inference proxies, telemetry, and custom dataplanes.
AF_XDP is the middle ground. Hooks into the driver at the eBPF layer, gets sub-10 µs latency, and unlike DPDK does not steal the NIC from the OS. Right answer for mixed-use boxes.
XDP without AF_XDP is for inline drop/redirect/forwarding (DDoS scrubbing, load balancers). Processes packets inside the kernel at the driver hook; does not move them to user space.

For an AI cluster: RDMA over InfiniBand or RoCEv2 for GPU-to-GPU, plain TCP/IP for everything else (management, telemetry, model download). Do not over-engineer the slow paths.

Interrupt coalescing, GRO/LRO: the throughput trap

Linux and most NIC drivers ship tuned for throughput, not latency. Two knobs matter:

Interrupt coalescing. NIC waits a configured µs window (or packet count) before raising an interrupt. Reduces per-packet overhead; adds latency equal to the window. rx-usecs 50 adds up to 50 µs on a quiet link.
GRO / LRO. Kernel (GRO) or NIC (LRO) aggregates incoming TCP segments into one larger skb before pushing up. Cheaper to process, but the aggregator deliberately waits for more packets, adding microseconds.

The trade-off is honest and non-negotiable: you cannot have both peak throughput and peak latency on the same config. For a GPU node, the RDMA NIC handling allreduce wants rx-usecs 0 or 1, adaptive coalescing off, GRO off on the RDMA-side interface. A separate management NIC keeps defaults; it is not on the critical path.

The corollary: a NIC dual-purposed for RDMA collectives and TCP downloads will deliver mediocre numbers for both unless you split traffic across queues or interfaces. Serious AI nodes have one or two dedicated RDMA NICs (ConnectX-7/8, BlueField-3) plus a separate management NIC.

The application layer is not free either

Two costs at the top of the stack rarely make it into the architecture diagrams:

memcpy. A 4 MB memcpy takes ~200 µs on a single core at ~20 GB/s. If your collective copies tensor data into a staging buffer before posting, you have burned more time than the entire network round trip. GPUDirect RDMA skips this — the NIC reads directly from GPU memory.
Serialization. Protobuf, JSON, or hand-rolled framing adds tens of µs for non-trivial payloads. Fine in a control-plane RPC; fatal in the allreduce inner loop. NCCL avoids it with fixed binary descriptors and pre-registered memory.

If your "RDMA-tuned" stack is still slow, profile user-space before blaming the fabric. We have seen 30 µs collectives where 25 µs was a PyTorch tensor reshape and only 5 µs was wire and switch.

Collective ops: latency vs bandwidth, and why packet size matters

NCCL (and any collective library) picks an algorithm based on message size:

Small messages are latency-bound. NCCL uses tree algorithms and its LL protocol (4-byte data with 4-byte flags via 8-byte atomic stores, no memory fences). BusBw is a fraction of line rate; you are measuring per-message overhead × number of messages.
Large messages are bandwidth-bound. NCCL switches to ring algorithms approaching the line rate of the slowest link. Per-message latency disappears under the megabytes.

The transition lives roughly between 64 KB and 1 MB depending on topology and NIC. Below it, switch and NIC latency and the protocol stack dominate. Above, you are reading wire speed off the chart.

Allreduce busbw — H100 / NDR InfiniBand single-rack clusters

Message size	Algorithm	BusBw (single node, 8× NVLink)	BusBw (8 nodes, NDR IB)
1 KB	Tree / LL	~5 GB/s	~2 GB/s
64 KB	Tree	~80 GB/s	~20 GB/s
1 MB	Ring	~250 GB/s	~40 GB/s
64 MB	Ring	~370 GB/s	~45 GB/s
1 GB	Ring	~450 GB/s	~48 GB/s

Published H100 NVLink ceiling is 450 GB/s; getting there at smaller messages or across nodes is a tuning fight. For Kentino's lineup — 5090 / 4090 / RTX Pro 6000 Blackwell over 100 GbE RoCE — expect ~10–12 GB/s per-node busbw for large messages on 100 GbE, same algorithm shape.

Jitter is worse than latency

If allreduce latency on every node is 30 µs ± 1 µs, the barrier waits 30 µs. If it is 30 µs average with one node at 300 µs once in a thousand iterations, every other GPU sits idle for 270 µs each time that tail fires. Across a 100k-iteration epoch on 16 nodes, that is hours of lost compute.

That is why P99 / P999 latency matters for AI training more than mean latency. A barrier collective is a slowest-wins operation: the cluster moves at the speed of its worst node.

Sources of jitter:

Incast buffer congestion. AllReduce is many-to-one every iteration. Shallow-buffered switches drop packets, PFC pause kicks in, latency spikes 10×–100×. Deep-buffered AI switches (Jericho3-AI, Spectrum-X) absorb the burst.
Host CPU jitter. A cron job, an unbounded kernel thread, or a CPU frequency excursion produces a 1 ms outlier. Isolate cores for the NIC ISR/poll thread, disable C-states on critical CPUs, pin processes.
Adaptive coalescing. The driver "helps" by raising coalescing under load; latency spikes silently. Turn it off explicitly on the RDMA NIC.
Topology asymmetry. One leaf at 200 GbE, another at 100 GbE. The slower one is your floor for every collective.

Tail latency, not mean latency, is the right SLO for an AI fabric.

Measuring it right

ping measures kernel-stack ICMP RTT on the management plane. It is useless for characterizing an RDMA fabric. The tools that work:

sockperf ping-pong — sub-nanosecond-resolution UDP/TCP latency. Good for kernel-stack baselines and regressions.
ib_send_lat, ib_write_lat (perftest suite) — RDMA verb latency directly. The number you actually care about for InfiniBand and RoCE. Expect ~1–2 µs on a same-switch link.
OSU micro-benchmarks — osu_latency, osu_bw, osu_allreduce at the MPI level. The right tool for HPC apps before NCCL is in the picture.
nccl-tests — all_reduce_perf, all_gather_perf, reduce_scatter_perf. The only benchmark that matters for AI training. It exercises the exact code path your run uses. Sweep 8 B to 1 GB; the curve tells you where the fabric is broken.

Sanity check: if nccl-tests all_reduce_perf busbw at large messages is well under 80% of NIC line rate, you have a fabric problem, not a software problem. Walk the layers from the top of this article down.

Useful habit: store nccl-tests output as part of commissioning, re-run after every firmware, driver, or topology change. A regression caught the day it happens is a one-line diff. Caught three weeks later, it is a forensic exercise.

When latency truly matters — and when it doesn't

Latency matters:

Synchronous gradient allreduce in data-parallel training. Every iteration, every node. Tail latency = compute loss.
Fine-grained pipeline parallelism with small micro-batches. Bubble time at stage boundaries is floored by inter-node latency.
Tensor-parallel layer splits that fan out and back within a forward/backward pass — a chain of small collectives, pure latency game.
Parameter-server step communication at high update rate.

Latency mostly doesn't:

Inference batching at request granularity. Per-request latency is dominated by GPU compute (TTFT, per-token decode). 10 µs of network is noise next to 50 ms of decoding.
Bulk model load from object storage at job start. Throughput-bound.
Checkpoint write to networked storage. Big sequential writes, tune for bandwidth.
DataLoader shuffle through worker prefetch. Batched, throughput-bound.

The honest implication: a single-node 4×- or 8×-GPU server (Kentino's bread-and-butter K-AI lineup) does its inter-GPU communication over NVLink or PCIe, not the network. The network is for storage, telemetry, and the occasional multi-node experiment. Optimizing the fabric for collective latency only pays off with real multi-node training, which most Kentino customers do not run. For pure inference and single-node training, 25 GbE with a clean kernel stack is enough.

What to do next

If you are commissioning a new fabric, run this sequence:

ib_send_lat between every pair of nodes. Anything more than 1.5× the median is a flag — bad cable, dirty optic, misrouted topology.
nccl-tests all_reduce_perf across 8 B → 1 GB. Save the curve. Compare to the published reference for your NIC and switch class.
Compare TCP sockperf ping-pong against RDMA ib_send_lat. Gap should be 10–30×. If it is 2×, your kernel stack is unusually fast (unlikely) or your RDMA path is broken (likely — wrong PFC config, wrong DCQCN setup, RDMA falling back to soft path).
Re-run nccl-tests under load. Push storage traffic concurrently and watch busbw degrade. Healthy clusters degrade <20% under realistic mixed load; sick clusters collapse.
Tune one variable at a time. Coalescing, adaptive routing, PFC thresholds, ECN markings. Document every change. Change three things and one helps — you don't know which.
Alert on P99 collective latency, not mean. Mean hides the problem; P99 is what training actually feels.

The follow-ups — N07 on routing and congestion control (ECMP, DCQCN, ECN), N08 on RDMA setup in practice, and K02 on distributed training algorithm choice — go deeper into specific decisions this article only sketches.

Latency in an AI fabric is not the wire, it is everything wrapped around the wire — and the fix is almost always to shorten the software path before you spend more on hardware.

This is part of the Kentino Wiki, a reference series on AI compute, robotics, and the systems that connect them. Comments and corrections welcome at info@kentino.com.

Položka byla přidána do košíku