Routing Complexity in AI Networks: ECMP, Adaptive Routing, DCQCN, and Why HPC People Obsess About It
Previous articles walked through cables, NICs, switches, topologies. This one is about what happens above: how packets find a path through a multi-switch fabric, and what stops the fabric from collapsing when ten thousand GPUs decide to all-reduce at the same instant.
AI traffic looks fundamentally different from a normal datacenter. A web frontend sends millions of small TCP connections. An AI training run sends a few enormous, perfectly synchronized flows to known peers, and everyone waits for the slowest one. The techniques that work for the first case fall apart for the second.
Warning up front: most Kentino customers run 1–4 node clusters. At that scale, none of these problems actually bite. Wire up four nodes with a single 100 GbE switch (or no switch — see N05), run vanilla RoCE with defaults, and never think about ECMP or DCQCN. We write this anyway because (a) it is useful background for sizing decisions, and (b) the day you go from four nodes to sixteen, all of it suddenly matters.
What ECMP is, and why it was good enough until AI
ECMP — Equal-Cost Multi-Path — is the routing trick that makes leaf-spine and fat-tree fabrics work. When there are several equal-cost paths from leaf A to leaf B (through spines S1..S4), the switch hashes packet fields and uses the hash to pick a spine. The classic 5-tuple hash is (src IP, dst IP, src port, dst port, protocol).
For traditional cloud workloads — millions of short-lived TCP connections — ECMP works beautifully. The law of large numbers does the load-balancing for you. With 10,000 flows and 8 spines you get within a few percent of perfect balance.
- 10 000+ short-lived TCP flows
- Law of large numbers balances spines
- Near-perfect load distribution
- ECMP hash collision: rare
- 8 elephant flows, all simultaneous
- 5-tuple hash maps many flows to same spine
- Hot spine = half-speed allreduce
- Collision probability: ~97% on 8 flows/8 spines
ECMP was designed for many small flows. AI training sends a few huge synchronized flows — statistical balance breaks down.
The elephant-flow problem
AI training is the opposite of web traffic. A training step: every GPU computes a gradient, every GPU sends its gradient to all the others (allreduce), everyone waits for the slowest transfer, repeat.
Allreduce is a small number of very large flows. Ring-allreduce on 8 nodes with 200 Gbps NICs is 8 simultaneous flows, each multiple gigabytes, all at line rate, all starting at the same instant. Meta's RoCE-at-scale paper documented it clearly: in AI clusters, a tiny number of "elephant" flows account for nearly all the bytes, and they all want the network at once.
ECMP's 5-tuple hash hates this. With N elephant flows across S paths, zero-collision probability is S! / ((S-N)! · S^N). For 8 flows on 8 spines: about 2.4%. Even 32 spines for 8 flows hits only ~33%. Adding spines does not change the coin-flip — it gives you more empty links while two flows fight over one.
A collision oversubscribes the spine link 2:1, the affected flows run at half speed, and because the step waits for the slowest flow, the iteration runs at half speed. Production-cluster measurements report ECMP collisions causing up to 40% performance loss on allreduce.
Workarounds: enhanced hashing, packet-level load balancing, adaptive routing
Four real responses, in increasing order of disruption to deploy:
1. Better hashing (E-ECMP, QP-aware). Cheapest fix. Standard 5-tuple hashing collapses RDMA traffic onto one tuple per QP pair — a single allreduce flow really is one ECMP bucket. Hash on the RoCE destination QP number too, and have the application spread traffic across many QPs ("QP scaling"). Meta's numbers: up to 40% better allreduce. Still statistical — collisions are rarer, not eliminated.
2. Flowlet switching. Detect an idle gap, re-hash from there. Works for TCP, poorly for back-to-back RoCE.
3. Packet-level load balancing / packet spraying. Hash per-packet, accept reordering, let the NIC reassemble. Eliminates the elephant-flow problem but requires NIC and switch cooperation. This is what NVIDIA Spectrum-X does — per-packet spraying with hardware reordering at the SuperNIC.
4. Adaptive routing. Switch tracks per-port congestion and picks the least-loaded equal-cost path at switching time instead of hashing blindly. Combined with packet spraying this is what InfiniBand has had for fifteen years. Bringing it to Ethernet is the headline feature of NVIDIA Spectrum-X (Spectrum-4/5 ASICs + BlueField or ConnectX-8 SuperNICs); Cisco Silicon One G200/P200; and Broadcom Tomahawk 5 / Jericho 3-AI with cell-based scheduled fabric.
For 1–8 node clusters, adaptive routing is overkill. For 64+ GPU jobs scaling to 1,000 GPUs, it is the difference between "the network is the bottleneck" and "the GPUs are the bottleneck." That is why every serious AI Ethernet vendor has built or licensed adaptive-routing silicon in 2024–2026.
Congestion control: why "just don't drop packets" is not free
The other half of routing complexity is what happens when too much traffic arrives at one switch port at once. Two choices: a lossy network drops packets and the endpoints back off (the open Internet); a lossless network pushes back on the upstream switch to pause — no drops, but congestion propagates backwards.
RoCE has historically required a lossless fabric. RDMA NICs handle packet loss badly — a single dropped packet triggers go-back-N retransmit of the entire message, which on 100 GbE means re-sending megabytes. RoCEv2 with go-back-N is essentially unusable on a lossy network at high utilization.
PFC — Priority Flow Control (IEEE 802.1Qbb) is what makes Ethernet behave losslessly. When a switch's per-priority egress queue exceeds a threshold, it sends a PAUSE frame upstream asking the sender to stop transmitting that priority. Eight priorities, eight independent stop/go signals.
Head-of-line blocking and victim flows
PFC pauses are coarse. A pause says "stop sending priority 3 on this link" — it does not know which flow caused congestion. If ten flows share priority 3 and one is congested downstream, PFC pauses all ten. The other nine "victim flows" get punished for someone else's problem. This is head-of-line blocking, the central pain of lossless Ethernet at scale.
Head-of-line blocking: A→X causes PFC on the trunk. B and C talking to Y (which is fine) are also paused — victim flows.
PFC deadlock: if topology and traffic form a cyclic dependency of paused links — A waits for B, B for C, C for A — the entire fabric locks. Observed in production. Modern switches have deadlock detection; modern Clos topologies are deadlock-free by construction; but every serious RoCE deployment has a recovery plan anyway.
The point of modern AI congestion control is to keep buffers small enough that PFC never fires in steady state. PFC stays armed as a safety net for genuine microbursts. The mechanism that keeps it from firing is DCQCN.
DCQCN — the standard congestion control for RoCEv2
DCQCN (DataCenter Quantized Congestion Notification) is the algorithm that lossless RoCE has converged on. Developed by Microsoft and Mellanox (SIGCOMM 2015). As of 2025–2026 it is what NVIDIA's ConnectX/BlueField NICs run by default, and what Azure reports as the production standard across "~85% of Azure traffic, RDMA, in all public regions."
DCQCN has three roles: the CP (Congestion Point) is the switch, marking packets with ECN when egress queue depth crosses a threshold (probabilistic from 0% at Kmin to Pmax at Kmax); the NP (Notification Point) is the receiver NIC, generating a CNP (Congestion Notification Packet) back to the sender on ECN marks (rate-limited, typically one per 50 µs per flow); the RP (Reaction Point) is the sender NIC, multiplicatively decreasing the QP's rate on CNP, and additively (then hyper-additively) increasing in their absence.
Typical switch config for DCQCN on 100 GbE:
# NVIDIA Cumulus-style ECN profile on the lossless RoCE priority (priority 3)
interface swp1..swp32
qos remark dscp-to-tc 26 to 3 # DSCP 26 → TC 3
qos congestion-mark ecn priority 3
qos ecn-kmin 5KB # start marking
qos ecn-kmax 200KB # mark at Pmax
qos ecn-pmax 1%
qos pfc priority 3
qos pfc xoff 400KB # PFC pause (well above Kmax)
qos pfc xon 300KB
Kmin/Kmax/Pmax is the most-tuned triple in RoCE. Kmin small (a few packets) so ECN kicks in before buffer exhaustion; Kmax much larger so marking ramps gradually; PFC pause threshold well above Kmax so PFC only fires if DCQCN was too slow. Microsoft's original deployment: Kmin = 5 KB, Kmax = 200 KB, Pmax = 1%.
DCQCN's known weaknesses: incast collapse (100 senders hit one receiver, queue fills faster than CNPs propagate back, PFC fires — DCQCN+ addresses this); long-tail unfairness (late-starting flows get throttled longer); parameter sensitivity (Kmin tuned for 4 nodes does not generalize to 64). Honest summary: DCQCN is the default because it works most of the time. HPCC, Swift, EQDS, revived TIMELY exist as proposed replacements but none have displaced it as of 2026.
ECN, DCTCP, and where pure TCP fits
Clean separation because these get conflated:
- ECN — IP-layer mechanism (RFC 3168) where switches mark instead of drop. A signal, nothing reacts on its own.
- DCQCN — the RDMA endpoint reaction. Reads ECN via CNPs, adjusts rate.
- DCTCP — the TCP endpoint reaction. Reads ECN in ACKs, scales the window by the marked-packet fraction. Microsoft + Stanford, SIGCOMM 2010, shipped in Windows Server and Linux.
| Traffic | Protocol | Congestion control |
|---|---|---|
| GPU-to-GPU gradient (allreduce, NCCL over RoCE) | RoCEv2 | DCQCN (ECN + CNP + rate) |
| Storage (NFS/RDMA, BeeGFS over RDMA) | RoCEv2 | DCQCN |
| Storage (NFS over TCP, S3 to object store) | TCP | DCTCP (or BBR, or CUBIC) |
| Kubernetes / orchestration / control plane | TCP | CUBIC default, DCTCP if tuned |
| Telemetry / Prometheus / SSH | TCP | CUBIC |
If you run a pure RoCE fabric, you are running DCQCN — know your defaults. If you also have TCP storage/control on the same wire, DCTCP is worth enabling (net.ipv4.tcp_ecn = 1 plus ECN-aware switches).
Why AI traffic stresses all of this more than traditional datacenter
Three properties classic congestion control did not design for:
- Synchronized bursts. All GPUs start allreduce at the same nanosecond. Zero to 100% utilization in microseconds. No warm-up for slow-start to find the rate. DCQCN starts at line rate for exactly this reason.
- Few, large flows. Law of large numbers does not save you. 8 flows → common ECMP collisions; 8 receivers → severe PFC HoL blocking.
- The critical path is the slowest link. Allreduce step time = max flow completion time. No "average case." A 1% probability of a bad collision compounds across months of training.
This is why HPC and AI people obsess about networking in a way web-scale operators historically have not. The web is elastic — slow requests take a bit longer for some users. AI training is not — the whole job runs at the speed of its slowest synchronized component.
The switchless complexity (preview of N05)
N05 covers switchless topologies — direct-connect, mesh, ring, 2D/3D torus, tesseract — for small clusters where buying a 100 GbE switch is overkill. When you remove the switch, every node becomes a router: multiple NICs, multiple paths, something has to pick the next hop.
The standard answer in 2026 is FRR (FRRouting) on each node, configured for BGP unnumbered. BGP unnumbered uses IPv6 link-local addresses for peering, so you don't assign IPv4 addresses on every link. Each node announces its loopback, learns peers' loopbacks, the Linux routing table picks the right next hop.
# Minimal FRR config for a node in a switchless mesh:
router bgp 65001
bgp router-id 10.0.0.1
neighbor enp1s0 interface remote-as external
neighbor enp2s0 interface remote-as external
neighbor enp3s0 interface remote-as external
address-family ipv4 unicast
network 10.0.0.1/32 # this node's loopback
redistribute connected
exit-address-family
Twelve lines per node, ECMP across however many NIC pairs the node has — with the same caveats as switch ECMP. For a 4-node mesh, ECMP collisions show up on the host side too. Mitigate with more QPs or accept the loss (small at 4 nodes). "No switch" does not mean "no routing complexity" — it relocates it to the hosts.
When this matters for you
Most readers will not need to do any of this.
- 1–4 nodes, single switch, RoCE for storage and occasional multi-GPU training: leave the switch on defaults. Stock DCQCN and ECN thresholds are fine. PFC is enabled, you will probably never see it pause. ECMP collisions are real but the impact on a 4-node run is single-digit percent.
- 4–16 nodes, dedicated training cluster: enable DCQCN explicitly, set Kmin/Kmax/Pmax to 5 KB / 200 KB / 1% (Azure baseline), turn on QP scaling in NCCL, monitor PFC pause counters and CNP rates. If PFC pauses are increasing, DCQCN is not aggressive enough — lower Kmin.
- 16+ nodes or 1,000+ GPUs: adaptive routing pays for itself. Buy Spectrum-X with matching SuperNICs, or InfiniBand and stop worrying about ECMP, or partner with someone who has run a fabric at this scale. The cost of getting it wrong is months of training time.
-
Switchless cluster, any size: FRR + BGP unnumbered. 1–2 days of config and testing per node-count doubling. The big gotcha is forgetting kernel-side ECMP (
net.ipv4.fib_multipath_hash_policy = 1for L4 hashing).
The instinct "I'll buy more bandwidth so congestion never happens" is wrong. Synchronized AI traffic creates instantaneous demand at line rate on every path, no matter how many spines. Bandwidth solves rate problems, not coordination problems.
What to do next
If any of this might bite your cluster:
- Inventory your traffic. RoCE? TCP? Which links carry both? You cannot tune DCQCN without knowing which flows benefit.
- Read your switch's actual DCQCN/ECN/PFC defaults. Vendor "AI optimized" profiles vary wildly. Some ship with PFC disabled. Some use Kmin = port-speed × 5 µs — sensible heuristic, not always right.
- Turn on counters. PFC pause RX/TX, CNP RX/TX, ECN marks, ECMP per-link utilization. Without these, you cannot know whether your fabric is healthy under load.
-
Measure with a real workload.
nccl-tests/all_reduce_perftells you more in five minutes than a week of synthetic iperf. - Decide if you have an ECMP-collision problem before spending money on adaptive routing. Most 1–16 node clusters do not; most 64+ node clusters do.
N08 covers actual RDMA setup — GID indexes, MTU, NCCL env vars, per-NIC mlx5_core tuning that turns a working RoCE link into a fast one. That is where this theory becomes commands you type.
This is part of the Kentino Wiki, a reference series on AI compute, robotics, and the systems that connect them. Comments and corrections welcome at info@kentino.com.