NVLink and NVSwitch: When It Matters, and Why It Usually Doesn't for Kentino's Lineup

A recurring inbox question: a customer sizing a 4× or 8× GPU server sees NVIDIA's DGX marketing brag about NVLink bandwidth in the terabytes per second and asks whether the Kentino build "has NVLink." The honest answer is no, none of our builds do — and for the workloads the customer actually has, that is fine. This article unpacks why.

NVLink is genuinely impressive at the top of the lineup and absent everywhere else. The marketing does not draw a clear line, so buyers either over-pay for a fabric they do not need or under-buy thinking PCIe is a step-function downgrade across the board. Neither is true. The line is sharp and sits in a specific place.

What NVLink actually is

NVLink is a point-to-point, high-bandwidth GPU-to-GPU interconnect that bypasses the host's PCIe root complex. Two GPUs with an NVLink connection move tensors directly across the link without bouncing through CPU memory and without contending with anything else on the PCIe tree. That is the whole pitch.

The bandwidth advantage over PCIe is substantial. PCIe Gen5 x16 — the current ceiling for a consumer or workstation slot — gives about 64 GB/s in each direction, 128 GB/s aggregate. NVLink 5 on B200 and GB200 gives 1.8 TB/s aggregate per GPU, roughly 14× a PCIe Gen5 x16 slot.

This comparison is misleading the moment you write it down, because GPUs with NVLink 5 are not GPUs with PCIe Gen5 x16 as their primary interconnect. NVLink lives on datacenter SKUs (A100, H100, H200, B200, GB200); PCIe is the only path on consumer and workstation SKUs (4090, 5090, RTX Pro 6000 Blackwell, L40, L4). "NVLink versus PCIe" in practice means "the H100 line versus the rest."

NVLink generations at a glance

Generation	GPU	Links per GPU	Aggregate per GPU	Year
NVLink 2	V100 (Volta)	6	300 GB/s	2017
NVLink 3	A100 (Ampere)	12	600 GB/s	2020
NVLink 4	H100 / H200 (Hopper)	18	900 GB/s	2022
NVLink 5	B200 / GB200 (Blackwell DC)	18	1.8 TB/s	2024

Link count grew from generation 2 to 4, then per-link bandwidth doubled from gen 4 to 5 (50 GB/s to 100 GB/s). That is why NVLink 5 looks like a step-change — it is.

PCIe in a Kentino build:

Standard	Per-direction x16	Aggregate x16
PCIe Gen4 x16	32 GB/s	64 GB/s
PCIe Gen5 x16	64 GB/s	128 GB/s

Worst case (Gen4), PCIe is ~1/14 of NVLink 4. Best case (Gen5), PCIe is ~1/14 of NVLink 5. The ratio is roughly constant by NVIDIA's design.

The honest part: Kentino's lineup does not have NVLink

GPU	Form factor	NVLink?
RTX 4090	PCIe	No
RTX 5090	PCIe	No
RTX Pro 6000 Blackwell (WS/Server/Max-Q)	PCIe	No
L40 / L40S	PCIe	No
L4	PCIe	No
Intel Arc Pro B70	PCIe	n/a

NVIDIA removed the NVLink finger from consumer GeForce starting with Ada Lovelace. The 3090 was the last consumer card with a working bridge; the 4090 dropped it and the 5090 has none. The stated reason was "users want bandwidth inside a single GPU, not between two of them" — which conveniently aligned with training customers paying datacenter prices for inter-GPU bandwidth.

The interesting case is the RTX Pro 6000 Blackwell — a 96 GB workstation-and-server card on the same Blackwell silicon as B200, the obvious "serious GPU memory without going to B200" pick. It also has no NVLink. Not on workstation, not on server, not on Max-Q. No bridge connector on the PCB. NVIDIA's datasheets list NVLink as not supported across all three SKUs.

This is the deliberate segmentation line. NVLink means stepping up to H100, H200, B200, or GB200 — SXM form factor, HGX baseboard, different chassis, different cooling, allocation Kentino does not have. If you genuinely need NVLink, talk to an HGX-system vendor.

What you lose without NVLink

The penalty shows up in two specific workload patterns:

Tensor parallelism across GPUs. When a model is too big for one GPU and you split each layer's weight matrix across cards, every transformer layer requires an AllReduce across the shards. AllReduce is bandwidth-heavy and latency-sensitive. PCIe is the bottleneck.
Distributed training with fine-grained gradient sync. DDP, FSDP, and Megatron-style training do gradient AllReduces every step. The smaller the per-step compute and the larger the model, the more the interconnect dominates wall-clock.

Everything else — single-GPU inference, pipeline parallelism, data parallelism, embeddings, vision inference, ASR, TTS, diffusion image generation, fine-tuning a model that fits on one GPU — runs fine on PCIe. NVLink is irrelevant.

Measured TP scaling for a 70B-class LLM at INT4/INT8, from published 3090/4090/L40S benchmarks:

Configuration	TP scaling	Notes
2× GPU, NVLink (3090 + bridge)	~0.90–0.95	Near-linear
2× GPU, PCIe Gen4	~0.60–0.70	Significant AllReduce loss
2× GPU, PCIe Gen5	~0.65–0.75	Better, still bottlenecked
4× GPU, PCIe Gen5	~0.50–0.65	AllReduce cost grows
8× GPU, PCIe Gen5	~0.40–0.55	TP becomes painful

Read as ranges, not promises — exact numbers depend on model, batch size, sequence length, quantization, NUMA topology, and slot placement. The shape is real: PCIe tensor parallel scales sub-linearly and the penalty grows with GPU count. This is why vLLM's own docs recommend pipeline parallel over tensor parallel on PCIe-only systems above two GPUs.

The practical replacement: keep the model on one GPU

The under-appreciated fact about the current GPU landscape: an RTX Pro 6000 Blackwell has 96 GB of VRAM on a single card — enough to host a 70B at INT4 or INT8 in one GPU with KV cache room. If you avoid splitting a model across GPUs at all, NVLink is moot.

Model	Quant	VRAM	One Pro 6000?
7B / 8B	INT4	~5 GB	Yes, many copies
13B	INT4	~9 GB	Yes, many copies
32B	INT4	~20 GB	Yes, 4× concurrent
70B (Llama 3.3, Qwen)	INT4	~42 GB	Yes, plus KV cache
70B	INT8	~75 GB	Yes, tight
Qwen2.5-VL 72B	INT4	~48 GB	Yes
405B (Llama 3.1)	INT4	~240 GB	No — 3 cards
Mixtral 8×22B	INT4	~80 GB	Tight, one card

Single-card hosting is the right architecture for almost every model worth serving in 2026. Exceptions: very large dense models (405B, GPT-OSS 120B) and MoE layouts where the active expert set fits one card but the full weight set does not.

For multi-card on PCIe, the right choice is pipeline parallelism, not tensor parallelism. Pipeline parallel splits layers in long contiguous blocks (GPU 0 holds layers 0–39, GPU 1 holds 40–79, etc.). Inter-GPU traffic is just the activation tensor at each block boundary — a few hundred KB per token, not gigabytes per layer.

Parallelism mode	Inter-GPU traffic per layer	Sensitive to interconnect?
Tensor parallel	Activation × hidden dim, every layer	Yes — wants NVLink
Pipeline parallel	Activation at block boundaries only	No — PCIe is fine
Data parallel	Gradients at step boundary (training only)	Moderate
Expert parallel (MoE)	All-to-all on expert routing	Yes — NVLink helps

On an 8× 5090 server serving a 70B, you do not split the model across all eight cards. You run two instances with 4-way pipeline parallel, or four instances with 2-way pipeline, or — most commonly — eight independent instances of a smaller model behind a load balancer. The 8× server becomes an eight-replica throughput multiplier rather than one huge virtual GPU. For production inference, the replica architecture is usually the right answer regardless of NVLink availability: more concurrency, graceful degradation when a card fails.

When NVLink genuinely matters

Workloads where the absence of NVLink is a real problem, not a marketing problem:

Training a model that does not fit on one GPU. Pre-training or full fine-tuning of a 70B+ dense model requires the model split across GPUs with gradient AllReduces every step. NVLink is the difference between a productive 8-GPU rig and four cards mostly waiting on the bus.
Tensor parallel inference on very large dense models. If you need 405B served across GPUs and cannot accept pipeline-parallel latency-per-token, NVLink matters.
MoE with cross-GPU expert routing. MoE all-to-all is brutal on PCIe. DeepSeek-V3, Mixtral 8×22B and similar dense-via-MoE designs benefit clearly.
High-frequency RLHF / GRPO loops. Policy/reference sync repeated thousands of times per epoch hits the same AllReduce cost.
Multi-GPU diffusion training at scale. Some larger video diffusion models have tensor-parallel-like activation patterns.

If your workload is on this list, do not buy a Kentino 8× 5090 server and expect DGX H100 behaviour. Buy an HGX system, or rent H100/B200 in the cloud for the training phase and bring weights back on-prem for inference. That is a perfectly reasonable workflow and one we recommend openly.

NVSwitch: the chassis-level fabric

NVLink is point-to-point — GPU A to GPU B over a bundle of links. Above two GPUs in a chassis, you either give each pair its own dedicated NVLink (doesn't scale past four) or put an NVLink switch in the middle. NVIDIA's NVSwitch is that switch.

On an HGX H100 8-GPU baseboard, four NVSwitch chips give every GPU full-bandwidth NVLink 4 to every other GPU — 900 GB/s, all-to-all, no contention. On a GB200 NVL72 rack, NVSwitch scales across 72 GPUs in a single non-blocking topology, 1.8 TB/s per GPU, 130 TB/s aggregated. NVSwitch is what makes "one big virtual GPU" actually work; without it NVLink is just a faster pairwise cable.

Practical:

No NVSwitch in any Kentino build. NVSwitch ships only inside NVIDIA-certified HGX and DGX. No aftermarket chip drops into a Supermicro or Bone64c chassis.
No NVSwitch in any RTX card, ever. Datacenter-only.
GB200 NVL72 is rack-scale, not server-scale. 72 GPUs cooperate through copper-cabled NVLink at backplane speeds. Cables, switches, backplane all NVIDIA proprietary. List price runs to millions of US dollars with multi-quarter lead times. The high end of what NVLink enables in 2026. Not for us.

Cost and availability

NVLink-capable systems live in their own pricing tier. Approximate mid-2026 market, US/EU:

System class	GPUs	List price band	Lead time
4× RTX 5090 (Kentino-class)	4	€25k–€40k	2–4 weeks
8× RTX 5090 (Kentino-class)	8	€50k–€80k	3–6 weeks
4× RTX Pro 6000 Blackwell	4	€60k–€90k	3–6 weeks
8× RTX Pro 6000 Blackwell	8	€120k–€180k	4–8 weeks
HGX H100 SXM (8× H100, NVSwitch)	8	€250k–€350k	8–16 weeks
HGX B200 SXM (8× B200, NVSwitch)	8	€400k–€550k	12–24 weeks
GB200 NVL72 (72× B200)	72	€3M–€4M+	6–12 months

The price gap between a Kentino 8× Pro 6000 build and an HGX H100 is roughly 2× for the same nominal GPU count. The performance gap for non-NVLink-dependent workloads is much smaller than 2×. For NVLink-dependent work (large-model training, tensor parallel on 405B) the H100 box is the right tool and the price is justified. Rule of thumb: if your workload fits on one 96 GB GPU, the Pro 6000 build saves 50%+ of budget. If it does not, pay for NVLink.

Summary

Question	Kentino lineup answer
Any current card with NVLink?	No
Any current build with NVSwitch?	No
Tensor-parallel a 70B?	Yes, ~0.6–0.7× scaling penalty over PCIe
Pipeline-parallel a 70B?	Yes, near-linear
Fit a 70B on one card?	Yes — RTX Pro 6000 Blackwell, 96 GB
Train a 70B from scratch?	Not efficiently — go cloud or HGX
Serve 405B dense?	Only pipeline-parallel across 3+ Pro 6000s
MoE at scale?	Smaller MoE yes; DeepSeek-class no
Build a DGX equivalent?	No

What to do next

If you are sizing a system and unsure whether you need NVLink, work the problem in this order:

Write down the largest model you need to serve, with quantization. If it fits on one GPU, NVLink is irrelevant. Stop.
If it does not fit, ask whether pipeline parallel is acceptable. Pipeline adds latency-per-token but throughput is fine. For batch inference and most chat workloads, that is acceptable.
If pipeline parallel is not acceptable (you need minimum single-stream latency on a very large model), you need tensor parallel. On PCIe you pay a 30–50% tax. If that tax breaks your economics, NVLink is worth the system upgrade.
If you are training, the answer is almost always NVLink. Training dense models above 13B on PCIe is a bad use of GPU-hours. Rent NVLink in the cloud or buy HGX.
For inference, a single-card Pro 6000 Blackwell or a multi-replica 4×/8× 5090 is usually the right answer. This is what most of our customers buy, and it works.

NVLink is not bad. It is excellent at what it does. NVIDIA has drawn a hard segmentation line, and below that line the right architectural response is "host smaller models, replicate horizontally, use pipeline parallel when you must split." That is what the Kentino lineup is built for.

Follow-ups: InfiniBand and RoCE for cluster-scale interconnect (N02), switched cluster topologies (N04), and PCIe-as-interconnect for small clusters (K07).

This is part of the Kentino Wiki, a reference series on AI compute, robotics, and the systems that connect them. Comments and corrections welcome at info@kentino.com.

Item added to your cart