PCIe Lanes & Topology in a Multi-GPU AI Server
There is a persistent myth in the consumer AI space that "PCIe x8 vs x16 doesn't matter for inference." It is mostly correct, and the people repeating it almost never know why. They also tend to fall over the moment you ask them why a desktop board can't host a fourth GPU at full bandwidth, or why an 8-GPU EPYC server doesn't behave like two 4-GPU desktops glued together.
This article is the long version. It covers what PCIe lanes are, how they're allocated on the CPUs Kentino builds with, what bifurcation and switches do, where NVLink fits (and where it doesn't), and when topology actually matters. It ends with concrete 4-GPU and 8-GPU diagrams for the EPYC platforms we ship.
What a PCIe lane is, briefly
A PCIe lane is a pair of differential serial links — one for each direction — between a CPU root complex (or a downstream switch) and a device. Multiple lanes are bonded to form a wider link: x1, x4, x8, x16. Bandwidth scales linearly with lane count, and roughly doubles with each generation.
| Generation | Per-lane raw | x16 raw | x16 usable (~) |
|---|---|---|---|
| Gen3 | 8 GT/s | 16 GB/s | ~15.75 GB/s |
| Gen4 | 16 GT/s | 32 GB/s | ~31.5 GB/s |
| Gen5 | 32 GT/s | 64 GB/s | ~63 GB/s |
| Gen6 | 64 GT/s | 128 GB/s | ~121 GB/s |
Two caveats. The bandwidth is per-direction — a Gen5 x16 link does 64 GB/s each way simultaneously, which is why marketing slides quote either "64 GB/s" or "128 GB/s." And Gen6 is not shipping in any GPU you can buy today. The spec is finalized but the silicon isn't in workstation cards as of May 2026; the first Gen6 endpoints are datacenter-only parts that Kentino doesn't build with. For our actual lineup — RTX 5090, RTX 4090, RTX Pro 6000 Blackwell (both variants), L40, L4 — Gen5 x16 is the ceiling.
Lane budgets per CPU socket
This is where the desktop / workstation / server boundary becomes visible. Lane counts on the CPUs in current Kentino builds, after subtracting what's reserved for chipset / DMI:
| CPU class | Generation | Total PCIe lanes | Usable for GPUs / NICs / NVMe |
|---|---|---|---|
| Intel Core (LGA1700/1851) | Gen5/4 mix | 20 | ~20 (very tight) |
| Intel Xeon W7 / W9 (Sapphire R.) | Gen5 | 112 | ~112 |
| AMD Ryzen 9000 (AM5) | Gen5 | 28 | ~24 |
| AMD Threadripper 7000 | Gen5 | 92 | ~88 |
| AMD Threadripper Pro 7000 WX | Gen5 | 128 | ~128 |
| AMD EPYC Genoa (9004) | Gen5 | 128 | ~128 (single socket) |
| AMD EPYC Turin (9005) | Gen5 | 128 | ~128 (single socket) |
| AMD EPYC dual-socket | Gen5 | 160 | shared via xGMI |
Three consequences fall out of this immediately.
A consumer desktop CPU cannot host four full-bandwidth GPUs. With 20–28 lanes total, you allocate one x16 to the primary slot, one x4 to NVMe, and you've run out. "4-GPU desktop" builds that bifurcate x16 into 4×x4 work for inference because most inference doesn't saturate x4 Gen5 (~16 GB/s). They do not work for training across cards because of gradient sync traffic.
A workstation Xeon W or Threadripper Pro hosts four GPUs at x16 comfortably — 64 lanes for GPUs, plenty left for NVMe and a 25/100 GbE NIC.
A single-socket EPYC Genoa or Turin gives you 128 lanes, which is the only sensible way to build an 8-GPU server with all eight cards at x16. Dual-socket EPYC adds nominal lanes but the gain is smaller than it looks, because cross-socket traffic flows over xGMI, which is finite and shared.
Bifurcation: cutting an x16 into smaller pieces
A PCIe slot is physically x16, but the host can be told to electrically present it as smaller links. The standard cuts are:
- x16 → 2 × x8
- x16 → 4 × x4
- x8 → 2 × x4
Bifurcation lives in the CPU and is exposed by motherboard BIOS. Whether you can actually use it depends on three things being true at once: the CPU supports it, the BIOS exposes the option, and the riser/backplane is wired to split the lanes correctly. The first two are usually fine on server-class boards (Supermicro, ASRock Rack, Gigabyte). The third is where people get burned — different vendors map lanes differently.
Bifurcation is the trick that lets you fit more GPUs in a chassis than the CPU has x16 slots for. An 8-GPU EPYC server is rarely 8 native x16 root ports; it's a mix of native and bifurcated slots routed through risers, with each GPU getting x16 or x8 depending on layout.
What you lose: bandwidth per card. A bifurcated x8 Gen5 link is 32 GB/s — half of x16. For inference this is invisible. For multi-GPU training, it shows up in gradient sync and activation passing.
PCIe switches and retimers
If bifurcation isn't enough — say you want eight GPUs all at x16 — the answer is a PCIe switch. The Broadcom PEX series is the canonical example. A PEX 89000-class switch takes one x16 upstream from the CPU and fans it out to multiple x16 downstream ports. The downstream ports oversubscribe the upstream link; if all eight GPUs hammer the host simultaneously, they share the upstream x16.
This is the architecture inside NVIDIA's HGX baseboards (and the SXM systems Kentino doesn't build). It works because in well-behaved multi-GPU workloads, most traffic is GPU-to-GPU (NVLink or PCIe peer-to-peer), not GPU-to-host. The upstream link only carries weights at load time, occasional checkpointing, and storage I/O. Inference doesn't saturate it; training mostly doesn't either, if collectives stay between GPUs.
Retimers are different: signal repeaters that let a Gen5 link run over a cable longer than spec allows. They don't change topology — they make the chosen topology physically achievable. Every 8-GPU EPYC chassis Kentino ships uses retimers because the cable runs from motherboard to GPU bays exceed Gen5's native reach.
NVLink — what it is and where it isn't
NVLink is NVIDIA's proprietary GPU-to-GPU interconnect, separate from PCIe. It uses a dedicated set of high-speed lanes on the GPU edge (or through SXM connector or NVLink Bridge) to provide direct memory access between GPUs at much higher bandwidth than PCIe.
| Interconnect | Aggregate bandwidth | Cards that support it (in 2026) |
|---|---|---|
| PCIe Gen5 x16 | 64 GB/s | All current PCIe GPUs |
| NVLink 4 bridge | 600 GB/s | A100, H100 PCIe variants (mostly retired) |
| NVLink 5 (SXM) | 1800 GB/s | H100 SXM, H200, GB200, B200 — all SXM-only |
| NVLink (Pro 6000 SXM) | N/A | RTX Pro 6000 Blackwell is PCIe, no NVLink |
The key fact for any build Kentino ships: none of our cards have NVLink. The RTX 4090 dropped the connector the 3090 had. The 5090 doesn't have it. The RTX Pro 6000 Blackwell (both Workstation and Max-Q) is PCIe-only. L40 and L4 likewise.
This isn't an oversight. NVIDIA reserves NVLink for SXM datacenter GPUs and the few PCIe cards with an NVLink bridge — and those are being phased out as the high end moves fully to SXM. If you want NVLink, you're buying HGX with H100/H200/B200 modules at ten times the price, and Kentino doesn't build that.
Without NVLink, GPU-to-GPU collectives (all-reduce, all-gather, reduce-scatter) go over PCIe peer-to-peer. Effective bandwidth between any two cards is bottlenecked by the slower of the two PCIe links and whatever switch or root port sits between them. On an 8-GPU EPYC system, P2P between GPUs on the same switch is fast; P2P across root complexes goes through the CPU and is slower.
For inference this almost never matters — inference is memory-bound on the local GPU, with batched activations only occasionally crossing GPUs. For training with tensor parallelism, this is the single biggest reason an 8×5090 EPYC build is not equivalent to an 8×H100 HGX node, even when the raw FLOPS look comparable.
When PCIe bandwidth actually saturates
| Workload | Saturates PCIe? | Notes |
|---|---|---|
| Single-GPU inference (LLM, batch 1) | No | Model lives in VRAM; PCIe only for tokens |
| Single-GPU inference (LLM, large batch) | No | Throughput rises with batch; PCIe still idle |
| Single-GPU vision inference | Sometimes | If feeding from CPU memory, x8 noticeable |
| Multi-GPU inference (tensor parallel) | Sometimes | Activations cross GPUs every layer |
| Multi-GPU inference (pipeline parallel) | Rarely | Only activations at stage boundaries |
| Model loading from NVMe / network | Yes | A 140 GB Llama-405B Q8 wants every GB/s you have |
| Training, single GPU | No | Same as inference |
| Training, multi-GPU, ZeRO-1/2 | Yes | Gradient all-reduce hammers the link |
| Training, multi-GPU, ZeRO-3 / FSDP | Yes, hard | Parameter all-gather every forward step |
| Training, multi-GPU, tensor parallel | Yes, hard | Without NVLink, this is the worst case |
The pattern is consistent: inference doesn't saturate PCIe; training does. If the build will spend its life serving inference — which is true for most buyers — x8 Gen5 per card is fine, and you can pack more GPUs into less topology budget. If you're training, you want every card at x16 and the GPUs grouped so collectives don't traverse the slowest path.
Topology for a 4-GPU build (EPYC Genoa, single socket)
This is the standard Kentino 4-GPU configuration on AMD EPYC. It also works on Threadripper Pro 7000 WX with identical lane allocation.
4-GPU EPYC: each GPU gets a dedicated x16 Gen5 link directly to the CPU root complex. No switch, no oversubscription.
Each GPU gets a full x16 Gen5 link straight to the CPU root complex. No bifurcation, no switch, no retimer for the GPUs themselves (risers may still have retimers depending on chassis layout). P2P between any two GPUs goes through EPYC's internal fabric and is symmetric — all four cards are equidistant in topology terms.
This is the cleanest multi-GPU build available. It's what we ship for 4× RTX 5090, 4× RTX 4090, 4× RTX Pro 6000 Blackwell, and 4× L40.
Topology for an 8-GPU build (EPYC Genoa / Turin, single socket)
128 lanes is not enough to give eight GPUs a full x16 each — that would consume the entire budget with nothing left for NVMe or networking. The standard layouts are:
Option A: All 8 GPUs at x16, with a switch fabric
x16
x16
x16
x16
x16
x16
x16
x16
Option A: two PCIe switches, each with one x16 upstream to CPU and four x16 downstream GPUs. Intra-switch P2P is fast; cross-switch P2P traverses CPU.
Each switch upstream-connects to the CPU at x16 Gen5 and fans out four x16 downstream ports. GPUs 0–3 share a 64 GB/s upstream link to the CPU; GPUs 4–7 share another. P2P between GPUs 0 and 1 is fast (same switch); P2P between GPU 0 and GPU 4 traverses the CPU root complex and is slower. Kentino's 8-GPU builds on Supermicro and Bone64c chassis follow this model.
Option B: All 8 GPUs at x8, direct to CPU
Option B: 8 GPUs at x8 direct to CPU via bifurcation. No switch, no oversubscription, 32 GB/s per card. Inference-only builds only.
No switch, no oversubscription, lower CPU-to-GPU latency. Each card gets 32 GB/s instead of 64. For inference this is invisible. For training under heavy collective comms it's meaningfully worse than Option A — the per-card link is smaller and there's no fast intra-switch P2P.
Kentino's default 8-GPU build is Option A (switched fabric) for training-capable systems and Option B (bifurcated direct) for inference-only builds where the lane budget is better spent on NVMe and dual 100 GbE NICs.
Signal integrity at Gen5
Gen5 is fast enough that the physical layer matters in a way it didn't at Gen3 or Gen4. A Gen5 trace runs ~7 inches on standard FR4 PCB before the eye closes. A Gen5 cable runs ~20 cm without a retimer. That's enough for a slot adjacent to the CPU; it's not enough for a riser cable in a 4U chassis 40 cm away.
What this means:
- Risers matter. Gen4 risers won't pass Gen5 signals. You need Gen5-rated risers, usually with a retimer inline. The cost difference is real — €80–€150 per riser.
- Cable length is hard-bounded. Over 30 cm needs a retimer; over 70 cm needs two. This is why "external GPU box" products at Gen5 don't really exist.
-
A flaky link will silently downtrain to Gen4 or Gen3. The system boots, the GPU appears in
nvidia-smi, inference runs. Bandwidth is a quarter of what you paid for.nvidia-smi --query-gpu=pcie.link.gen.current,pcie.link.width.currentis the first thing to check on a new build. - BIOS matters. Some Supermicro and ASRock Rack boards default to "Auto" link speed and pick conservatively. Force Gen5 explicitly and confirm post-boot.
We've seen builds where one of eight GPUs trained at Gen3 x8 for weeks because nobody looked. The job ran. It was 4× slower than it should have been.
When topology matters and when it doesn't
Mostly doesn't matter if: you're running single-GPU inference at any model size; multi-GPU inference with replica parallelism (each GPU runs its own copy); batch inference dominated by VRAM bandwidth; or loading models once and serving for hours.
Matters a lot if: you're training multi-GPU with the model split across cards (tensor or pipeline parallel); doing FSDP or ZeRO-3 where parameters are sharded and re-gathered every step; running RLHF or other workloads with frequent gradient sync; hot-swapping models during operation; or feeding GPUs from a remote storage tier where PCIe is the funnel.
For most buyers — inference servers for LLM/VLM serving, robotics backends, AI server hosting — topology is a check-the-link problem, not an architectural one. For the few doing serious multi-GPU training, topology (switched vs bifurcated, NVLink vs not) is the architectural decision, and it's why an 8×5090 EPYC build is the right tool for some training jobs and the wrong tool for others.
What to do next
If you're speccing a multi-GPU build, the questions to answer before settling on a CPU:
- How many GPUs, and at what lane width? 4 × x16 fits on a workstation; 8 × x16 needs EPYC + switches; 8 × x8 needs EPYC + bifurcation.
- Inference or training? Inference only: x8 per GPU is fine, save the switch cost. Training with tensor parallel: x16 and a switch, full stop.
- What else needs lanes? A 100 GbE NIC takes x16. Four U.2 NVMe drives take x16. Plan before committing.
- What's your Gen5 riser story? Budget €100/riser, confirm Gen5-rated with retimers as needed. See W03 for the riser detail.
- Are you certain you don't need NVLink? If your training workload is bound at the interconnect, no PCIe topology will rescue you. That's when the conversation moves to HGX-class hardware — which Kentino doesn't build. Better to know up front than after racking.
After the system is built and benchmarked once, PCIe topology is something you forget about for three years. The moments to pay attention: day one (verify every GPU's link speed), after any BIOS update (link training settings reset), after any physical reseat (cables wiggle, links downtrain), and before any training run that costs real money.
This is part of the Kentino Wiki, a reference series on AI compute, robotics, and the systems that connect them. Comments and corrections welcome at info@kentino.com.