RDMA Setup in Practice + Cluster Uplink Design

The previous articles in the N track argued for RDMA (N02) and walked through topology choices (N04, N05). This one is the hands-on part: install the drivers, prove the path works, turn on GPUDirect, validate NCCL, then step up one altitude and think about how the whole cluster connects to the world.

We assume Ubuntu 22.04 or 24.04, Mellanox/NVIDIA ConnectX-5 or ConnectX-6 NICs, and either InfiniBand HDR/NDR or RoCEv2 over a lossless Ethernet fabric. The commands are the ones we actually type on Kentino test benches before a 4-node K-AI cluster ships.

Drivers: MLNX_OFED or upstream rdma-core?

Path What you get When to pick it
MLNX_OFED (now NVIDIA DOCA-OFED) NVIDIA-tested driver bundle, GPUDirect peermem, perftest, mlxconfig Production AI clusters with ConnectX-6/7 and GPUDirect
Upstream rdma-core + in-tree mlx5 What Ubuntu ships, no extra repo Lab boxes, single-node, no GPUDirect, no firmware tooling

For anything carrying NCCL traffic in production, install MLNX_OFED. The upstream mlx5 works, but you lose mlxconfig, the bundled perftest, and — most importantly — a kernel-side nvidia-peermem tested against the same OFED tree.

A clean install on Ubuntu 22.04 with kernel 5.15.x:

tar xf MLNX_OFED_LINUX-*.tgz && cd MLNX_OFED_LINUX-*
sudo ./mlnxofedinstall --add-kernel-support --with-nvmf --force
sudo /etc/init.d/openibd restart && sudo systemctl enable openibd

The --add-kernel-support flag matters. Skip it on a kernel outside the OFED matrix and the DKMS build fails silently — you end up running stock mlx5 without knowing. Confirm the user-space stack with dpkg -l | grep -E 'libibverbs|rdma-core|mlnx-ofed'.

Bring the link up and confirm the NIC sees the fabric

Three commands tell you everything important:

sudo mst start && mst status   # firmware tools
ibstat                          # port state, width, speed
ibv_devinfo -v                  # GIDs, max_qp, MTU, hw revision

A healthy port shows State: Active, Physical state: LinkUp, the expected Rate: (e.g. 200), and the right Link layer: (InfiniBand or Ethernet). The two fields people miss:

  • Link layer: InfiniBand vs Ethernet. A dual-mode ConnectX flips with mlxconfig -d /dev/mst/mt4125_pciconf0 set LINK_TYPE_P1=2 (1=IB, 2=Ethernet). Reboot required.
  • Rate: matches expected speed. A 200 Gb/s NDR port that came up at 100 Gb/s is the most common silent failure: bad cable, wrong DAC length, or switch port forced low. Check before benchmarking.

On RoCE, also confirm v2 is selected (v1 is ethertype 0x8915, v2 is UDP/4791 and is what every modern stack uses): sudo cma_roce_mode -d mlx5_0 -p 1 -m 2.

Subnet manager (InfiniBand only)

InfiniBand is not Ethernet — nothing on the fabric routes until a subnet manager (SM) assigns LIDs. On a lab fabric, sudo apt install opensm && sudo systemctl enable --now opensm on one node. In production, run the embedded SM on the switch. Two software SMs racing each other is an afternoon of debugging nobody needs. RoCE has no SM — routing is the Ethernet fabric's job, which is why RoCE config is mostly about switch QoS, not about the host.

Prove RDMA actually works: perftest

Before any NCCL or framework run, prove the wire with perftest. Two nodes, server first:

# server (node A)        # client (node B)
ib_send_bw -d mlx5_0 -F --report_gbits -D 10
ib_send_bw -d mlx5_0 -F --report_gbits -D 10 10.10.1.1

Expected numbers on a clean fabric:

Link ib_send_bw (large msg) ib_send_lat (2-byte)
100 Gb/s EDR / 100 GbE RoCE 95–98 Gb/s 1.0–1.5 µs
200 Gb/s HDR / 200 GbE RoCE 188–197 Gb/s 0.9–1.3 µs
400 Gb/s NDR 370–395 Gb/s 0.8–1.1 µs

If you are 20% under those numbers, do not start chasing NCCL tuning. The fabric is wrong. Check in order: (1) MTU, (2) PFC on RoCE, (3) the cable/transceiver pair, (4) PCIe gen and lane count for the NIC, (5) NUMA placement. Sub-microsecond latency is achievable for in-rack NDR; anything over 5 µs on a single-switch RoCE fabric is broken.

GPUDirect RDMA: turning on the DMA path

The whole point of RDMA in an AI cluster is the NIC reading and writing GPU memory directly, bypassing the host. That needs nvidia-peermem (or, on newer kernels, DMA-BUF — NVIDIA now recommends DMA-BUF where the kernel supports it, but most production stacks still ship peermem).

sudo modprobe nvidia-peermem
lsmod | grep nvidia_peermem
echo nvidia-peermem | sudo tee /etc/modules-load.d/nvidia-peermem.conf

If it fails to load, the kernel was not built against an OFED-aware RDMA peer memory API. Install order matters: OFED first, then NVIDIA driver, then nvidia-peermem — peermem builds against the OFED headers at install time.

Prove the end-to-end path with a GPU-to-GPU RDMA write:

# server                                # client
ib_write_bw -d mlx5_0 --use_cuda=0 -F --report_gbits -D 10
ib_write_bw -d mlx5_0 --use_cuda=0 -F --report_gbits -D 10 10.10.1.1

--use_cuda=0 registers CUDA memory on GPU 0 as the RDMA buffer. If the result is within a few percent of the host-memory case, GPUDirect is working. If it is 5× slower, the path is staging through host memory — usually a peermem load problem or a PCIe topology where NIC and GPU sit on opposite NUMA nodes.

MTU and PFC for RoCE (this is where RoCE clusters live or die)

RoCEv2 over a lossless Ethernet fabric needs three things working together:

  1. A large MTU end-to-end. Set 9000 on every NIC, switch port, and router hop. RoCE picks the largest IB-style MTU that fits inside the Ethernet MTU — 9000 Ethernet gives RoCE a 4096-byte MTU, which is what you want.
  2. PFC on the RDMA priority. Link-layer pause that prevents drops on a designated traffic class. Standard practice: RDMA on priority 3, everything else on priority 0.
  3. ECN marking on switches and NICs. ECN is the long-term congestion signal; PFC is the short-term emergency brake. ECN does the work most of the time; PFC fires only when ECN cannot keep up.

On the host side, with mlnx_qos:

sudo mlnx_qos -i enp1s0f0 --pfc 0,0,0,1,0,0,0,0   # PFC on prio 3
sudo mlnx_qos -i enp1s0f0 --trust dscp
echo 106 | sudo tee /sys/class/infiniband/mlx5_0/tc/1/traffic_class

The DSCP value (26) and the PFC priority (3) need to agree at every hop. The switch fabric must mirror this: PFC enabled on priority 3, ECN marking on those queues, lossless buffer config, and per-port headroom sized for the BDP of the longest link.

Buying a switch from a vendor with documented RoCE templates (NVIDIA Spectrum, Arista, Cisco Nexus 9000) saves a week. Rolling PFC config by hand on a generic Broadcom whitebox is doable but it is a project. We have done it. We do not recommend it.

DCQCN (Data Center Quantized Congestion Notification) is the control loop that ties PFC and ECN together: ECN marks packets when a queue fills, the receiver echoes a CNP back, the sender slows down, then ramps when the queue drains. PFC is the fallback when DCQCN can't react fast enough. On modern ConnectX-6/7 firmware it is on by default, and at 4–16 nodes the defaults are fine. The tuning game (alpha, target rate, byte/timer thresholds) is for people running at scales where 0.5% on allreduce is worth two weeks of work.

NCCL: the variables that matter

NCCL is the layer that uses RDMA for PyTorch, JAX, DeepSpeed, vLLM tensor-parallel, and similar. It auto-detects, mostly correctly. Four environment variables show up in every production launch script:

Variable What it does When to set it
NCCL_IB_DISABLE 1 forces TCP sockets instead of IB/RoCE Debugging only
NCCL_SOCKET_IFNAME Interface for NCCL bootstrap (not the data path) Always — point to the management NIC so bootstrap doesn't race onto the RDMA fabric
NCCL_IB_HCA Which HCA(s) NCCL uses for the data plane Multi-NIC nodes — explicit beats auto
NCCL_NET_GDR_LEVEL How aggressively to use GPUDirect RDMA based on PCIe topology PIX/PHB when GPUs and NICs share a PCIe switch / NUMA node

A working launch on a 4-node, 4-GPU-per-node cluster:

export NCCL_SOCKET_IFNAME=eno1            # 1 GbE management network
export NCCL_IB_HCA=mlx5_0,mlx5_1          # both RDMA NICs
export NCCL_IB_GID_INDEX=3                # RoCE v2 GID
export NCCL_NET_GDR_LEVEL=PHB
export NCCL_DEBUG=INFO                    # one-shot, then drop to WARN

mpirun -np 16 -N 4 --hostfile hosts -x NCCL_SOCKET_IFNAME \
    -x NCCL_IB_HCA -x NCCL_IB_GID_INDEX -x NCCL_NET_GDR_LEVEL \
    -x NCCL_DEBUG ./build/all_reduce_perf -b 8 -e 8G -f 2 -g 1

The NCCL_DEBUG=INFO output is mandatory reading on the first run. It tells you which transport NCCL picked (Channel ... via NET/IB/0 GDR) per rank, per channel. See via NET/Socket anywhere and the RDMA path is not being used — you have not tested what you think you tested.

Validating with nccl-tests

nccl-tests validates the whole stack — driver, OFED, peermem, NCCL, network — end-to-end. The number that matters is bus bandwidth (busbw), not algorithm bandwidth (algbw). Bus bandwidth normalizes for ring/tree size and is what you compare against the NIC's wire speed.

Cluster size Expected allreduce busbw (large msg)
1 node, 4 GPUs over NVLink 200–400 GB/s
1 node, 8 GPUs over NVLink 250–500 GB/s
2 nodes, 4 GPUs each, 200 GbE RDMA 20–24 GB/s
4 nodes, 4 GPUs each, 200 GbE RDMA + GDR 20–22 GB/s

Inter-node allreduce caps at roughly NIC line rate / 2 (allreduce sends and receives each byte once per rank). 200 Gb/s ≈ 25 GB/s ceiling; 22 GB/s observed is healthy. If the number drops by 5× going from 1 to 2 nodes, RDMA is not being used for the inter-node hop. Read the NCCL_DEBUG=INFO output.

Now zoom out: cluster uplink design

Everything above is about the data plane — the fabric GPUs use to talk to each other. The other half of a useful cluster is the uplink: how this fabric connects to the outside world. People build the wrong uplink all the time.

A 4-node K-AI training cluster has three external relationships:

  1. The corporate network / WAN — model registries, dataset storage (S3, NFS, MinIO), Git, container registries, telemetry.
  2. Developer workstations — engineers SSHing in, launching jobs, copying checkpoints out, running Jupyter.
  3. Other clusters — a second training cluster, an inference cluster, a CI/eval cluster.

Bandwidth math

If 4 nodes do allreduce at ~22 GB/s each, the internal fabric moves roughly 700 Gb/s of east-west aggregate. The uplink does not need to match that. It needs to match the data ingest rate:

  • 4 nodes × 4 GPUs × ~1 GB/s per GPU (image/video model) = 16 GB/s ≈ 128 Gb/s sustained read from object storage.
  • LLM pretraining on tokenized text is far smaller — 1–4 Gb/s, because tokens are dense and one batch lasts a long time.
  • A fine-tune reloading checkpoints frequently peaks at 40 Gb/s for a few seconds, then drops to near zero.
Workload Sustained ingest Burst Uplink
LLM pretraining (tokenized text) 1–4 Gb/s 20 Gb/s 25 GbE
Image / video model training 50–150 Gb/s 200 Gb/s 2× 100 GbE LAG or 1× 200 GbE
Fine-tuning / RLHF with checkpoint shuffle 5–20 Gb/s 50 Gb/s 25–100 GbE
Inference cluster behind a load balancer 5–50 Gb/s model-dependent 25–100 GbE

Common mistake: spending €40k on a 400 GbE uplink because the internal fabric is 400 GbE. Wrong target. The right target is the dataset read rate. We have built 4-node clusters with a 25 GbE uplink that ran flat-out for weeks on LLM token data.

Aggregated vs allocated bandwidth

A node with four 100 GbE NICs has 400 Gb/s aggregate wire capacity. That capacity is per-flow allocated, not pooled. A single TCP connection between two IPs uses one NIC — at most 100 Gb/s. ECMP and per-flow hashing distribute different flows across the four NICs.

  • Allreduce is many flows in parallel — one per channel per peer. NCCL natively spreads across multiple NICs (NCCL_IB_HCA listing both). 4× 100 GbE is functionally close to 400 Gb/s for NCCL.
  • A single dataset stream (one HTTP GET pulling a 1 TB shard) is one flow. 4× 100 GbE gives it 100 Gb/s, not 400. To use the aggregate, the dataloader must open parallel streams — which DALI, WebDataset, and MosaicML's Streaming all do by design.

Fabric separation

Data plane
  • 100/200/400 GbE RoCE or HDR/NDR InfiniBand
  • NCCL, dataset reads, checkpoint writes
  • Lossless, PFC/ECN, dedicated switches
  • Sterile — no non-AI traffic
Management plane
  • 1 GbE or 10 GbE
  • SSH, Prometheus, NTP, syslog, IPMI/BMC
  • Cheap switches, standard L2/L3, no special QoS
  • Always works, even if data fabric is down

Run two fabrics always. The management plane must not depend on the data plane to function — you need SSH access when the RDMA fabric is broken.

The management plane is not optional. When the data fabric breaks — bad transceiver, PFC misconfig, switch crash — you need an SSH path that does not depend on the broken fabric. Debugging a RoCE storm over the link that is storming is the kind of mistake you make exactly once. IPMI/BMC and NTP live here too; clock skew is invisible until your distributed framework starts producing wrong gradients.

BGP unnumbered for routed Clos

Above ~8 nodes, L2 leaf-spine becomes painful — spanning tree limits, MAC table scaling, broadcast storms, no native multi-path. The modern answer is an L3 routed Clos: every leaf-spine link is unnumbered, BGP carries routes, ECMP spreads flows across spines.

BGP unnumbered peers over IPv6 link-local addresses the kernel auto-assigns, so you skip the per-link /31 bookkeeping entirely. A leaf running FRRouting on Linux looks roughly like:

router bgp 65001
 neighbor swp1 interface remote-as external
 neighbor swp2 interface remote-as external
 address-family ipv4 unicast
  network 10.1.1.0/24
  redistribute connected

Each leaf and each spine is its own AS, eBGP advertises loopbacks, ECMP across spines is free. RFC 7938 documents the pattern; Cumulus/NVIDIA, Arista, Cisco, Juniper all support unnumbered BGP today. At 4 nodes a single switch is fine. At 8–16 nodes with two spines, routed Clos starts paying for itself. Above 16 nodes it is the only sensible answer.

Connecting to other clusters

Do not put a training cluster and an inference cluster on the same RDMA fabric. Training is huge bursty allreduce; inference is steady-state small messages; QoS requirements diverge; a misbehaving training run will starve the inference path. Two separate fabrics meeting at an L3 router with normal IP routing is the right answer. Cross-cluster control traffic — job queues, log shipping, artifact transfer — rides the management plane or a dedicated cross-cluster Ethernet link. A 70B weight file is ~140 GB; a checkpoint is 2× that. At 25 GbE that takes ~90 seconds; at 100 GbE, ~22 seconds. Plan for the bigger end.

What to do next

A reasonable workflow for standing up RDMA on a fresh cluster:

  1. Wire it and check link layer first. Run ibstat on every node. Same rate, same MTU, same link layer. Fix the physical layer before touching software.
  2. Install MLNX_OFED, not just rdma-core, if GPUDirect or NCCL is in scope. Match the OFED build to the running kernel.
  3. Run perftest between every pair of nodes before touching frameworks. Numbers within 5% of theoretical = healthy. Anything below = stop and fix.
  4. Load nvidia-peermem, prove it with ib_write_bw --use_cuda=0. Result should match the host-memory case.
  5. Configure PFC, ECN, and DCQCN if you are on RoCE. On IB this step does not exist; that is half the reason people pick IB.
  6. Run nccl-tests allreduce on 2 nodes, then 4, then 8. Inspect the NCCL_DEBUG=INFO output on the first run of every cluster size. Confirm NET/IB ... GDR shows up.
  7. Separate fabrics. Management on cheap 1/10 GbE. Data on the RDMA fabric. Don't share.
  8. Size the uplink to your dataset ingest rate, not to the internal fabric speed. Most clusters need far less external bandwidth than they have.
  9. Above 8 nodes, plan routed Clos with BGP unnumbered. Below 8, a single switch is fine.

The follow-up articles in the N track cover the latency dissection (N06) and the routing complexity (N07) — that is where DCQCN tuning and ECMP hashing pathologies live. The K track picks up from here with distributed training (K02), inference clusters (K03), and storage (K04) — all of which assume the RDMA stack on this page already works.


This is part of the Kentino Wiki, a reference series on AI compute and the systems that connect it. Corrections welcome at info@kentino.com.