Single-Node Multi-GPU vs Multi-Node: When to Scale Out

The most expensive mistake at the buying stage is splitting a GPU budget across two nodes when one bigger node would have done the job. The second-most expensive is staying on one node when the workload genuinely needs fabric, then spending six months pretending the box is keeping up.

This article is the decision logic for that split: when a single 8-GPU box is the right answer, when it is not, and how to tell which side of the line your workload sits on. Companion articles cover the mechanics (K02 training, K03 inference, K07 PCIe limits, K06 failure handling); this one is the buyer's call.

The 8-GPU ceiling, by model

The first question is whether the model fits in one node. With 8× RTX Pro 6000 Blackwell (96 GB each) you get 768 GB of usable VRAM; with 8× RTX 5090 (32 GB each) you get 256 GB. Neither is small by 2026 standards, neither holds everything.

Model	Weights (FP8)	Weights (INT4)	8× 5090 (256 GB)?	8× Pro 6000 (768 GB)?
Llama 3.1 / 3.3 70B	~75 GB	~40 GB	Yes, comfortably	Yes, with KV headroom
Qwen 2.5 72B (incl. VL)	~80 GB	~44 GB	Yes	Yes
Mixtral 8x22B (141B total)	~140 GB	~75 GB	INT4 only, tight	Yes
Llama 3.1 405B	~400 GB	~210 GB	No	INT4 yes, FP8 marginal
DeepSeek-V3 (671B MoE, 37B act)	~670 GB	~340 GB	No	INT4 yes, FP8 marginal
Hypothetical 600B+ dense	600+ GB	300+ GB	No	Marginal or no

The cliff is at the 405B / 671B line. Below it, one 8-GPU Pro 6000 box is enough. At and above, you are either quantizing aggressively (INT4 weights — fine for inference, miserable for training) or crossing a node boundary.

"Fits" is not the same as "runs well." A model that occupies 95% of VRAM with no headroom for KV cache, prefix cache, CUDA graphs, or activation memory will preempt requests under any real load. The working rule: weights at 60–70% of VRAM, leaving 30–40% for everything else. With that constraint, 405B at FP8 does not comfortably fit on 8× Pro 6000 for inference at any useful concurrency — it fits the weights, not the workload.

When you should NOT scale out

Cases where staying single-node is unambiguously correct:

Inference for any model that fits. If model plus KV fits at target concurrency, multi-node TP over Ethernet or IB is strictly slower than single-node. PCIe Gen5 inside a box delivers ~50 GB/s between GPUs on the same switch; 200 Gbps IB between nodes delivers ~25 GB/s. A workload that limps on PCIe crawls on IB.
Single-tenant production serving. One model, one client, moderate concurrency. An 8-GPU Pro 6000 handles 70B with 32–64 concurrent requests easily. A second box is useful only as a hot spare or DP throughput doubling — neither is "scale out" in the tightly-coupled sense.
Research labs running 7B–72B models. Most academic and applied work in 2026 sits here — Llama 3.x 8B, Qwen 7B/14B/32B, Mistral, Gemma, the 70B fine-tune tail. None need more than one node.
LoRA / QLoRA fine-tuning. The point of PEFT is that you do not need full-model training resources. 70B LoRA fits on 4–8 GPUs in one node; 405B QLoRA fits on 8× Pro 6000.
Batch inference and offline workloads. If the SLA is "process this corpus by Friday," throughput-mode batching on one 8-GPU box handles it. Multi-node only helps when you cannot finish in time — usually because the model is too big, not because one node is too slow.

Roughly 80% of Kentino customers should buy one bigger node instead of two smaller ones, and most of the remaining 20% actually want DP replicas behind a load balancer, not a cluster.

When you MUST scale out

The cases where one node genuinely is not enough are narrower than people assume.

Training a 70B+ model from scratch. Eight GPUs is not enough wall-time. A 70B pretrain at published token budgets (1.5–15T) takes hundreds of GPU-months on H100-class hardware, more on PCIe consumer GPUs. This work needs 32–128+ GPUs and an SXM fabric. Kentino does not build this tier.

Full-rank fine-tuning of 70B+. Not LoRA — full fine-tuning with optimizer states, gradients, and activations resident. A 70B full fine-tune (FP16 weights + FP32 Adam + grad + activation) is 1.2–1.5 TB of state, past one 8-GPU node even with FSDP. Justifies a 2–4 node IB cluster.

Hosting 405B+ at production latency. Weights fit at INT4 on 8× Pro 6000, but KV cache plus concurrent serving at usable latency pushes you to two or more nodes. Two 8-GPU Pro 6000 boxes in TP=8 × PP=2 or TP=4 × PP=4 is the realistic minimum for Llama 3.1 405B at decent QPS. K03 unpacks this.

Multi-tenant production at >100k QPS aggregate. One 8-GPU node serves 500–2,000 tok/s aggregate at 70B FP8. Past tens of thousands of QPS you want multiple replicas, and past that you want a real cluster with a router and prefix-cache-aware routing. The right answer is usually many DP replicas, not one giant TP cluster.

Outside these four, the case weakens fast. Most "I need multi-node" turns out to be "I want more throughput" — a replica question, not a fabric question.

The single-node sweet spot

The geometry of a strong single-node build, on hardware Kentino actually ships:

Component	Choice	Why
GPU	8× RTX Pro 6000 Blackwell (96 GB)	768 GB VRAM holds every realistic 2026 open model
GPU (alternate)	8× RTX 5090 (32 GB)	Cheaper, 256 GB total, fine up to 72B class
CPU	EPYC 9554P or 9654 (single socket)	128 PCIe Gen5 lanes, no xGMI bottleneck
Interconnect	PCIe Gen5 x16 (switched fabric)	~50 GB/s GPU-to-GPU, no NVLink on these SKUs
RAM	768 GB–1 TB DDR5	Generous for dataset feeders and KV spill
Networking	2× 100 GbE (optional 400 GbE)	Plenty for inference egress and storage
Storage	4–8 U.2 NVMe + 2 M.2 boot	Local NVMe for datasets and checkpoint scratch

The key constraint: NVLink is not on these cards. RTX 5090, RTX Pro 6000 Blackwell, L40, L4 are PCIe-attached. NVLink-fabric SXM modules (H100 SXM, B200 SXM, GB200) require HGX baseboards we do not build. K07 covers the cost; N03 covers when NVLink matters.

The PCIe path is right for inference-first work and most training short of the frontier. Throughput-mode batching amortizes all-reduce cost for inference. For fine-tuning fixed-size models the wall-clock penalty versus SXM is 1.2–1.4× — usually acceptable. For tensor-parallel training of 70B+ from scratch the penalty is 2–3×, and the answer is "buy SXM or do not do this work on Kentino hardware."

The cliff between single-node and multi-node

The thing that makes multi-node a different category of system is the interconnect cliff between intra-node and inter-node, in bandwidth and latency.

Path	Bandwidth	Latency
GPU-to-GPU, same PEX switch (PCIe Gen5 x16)	~50 GB/s	sub-microsecond
GPU-to-GPU, cross-switch via root complex	~50 GB/s shared	low microseconds
400 Gbps InfiniBand NDR (inter-node)	~50 GB/s	1–2 microseconds
200 Gbps InfiniBand HDR (inter-node)	~25 GB/s	1–2 microseconds
100 GbE RoCE (inter-node)	~12.5 GB/s	5–15 microseconds
25 GbE TCP (inter-node)	~3 GB/s	20–50 microseconds

Inside a box, two GPUs talk at ~50 GB/s with sub-microsecond hops. Cross-node, you get ~25 GB/s on 200 Gbps IB — a 2× penalty on IB, 4–5× on 100 GbE, 15× on 25 GbE. For TP collectives that fire every transformer layer, that hurts badly. K07 has the all-reduce timing table.

Latency multiplies it: inter-node is 5–15 microseconds on tuned RoCE versus nanoseconds inside a box. For training and prefill this rounds away; for low-latency interactive inference with tight TP it does not.

The cliff is why "just add another box" is not a continuous decision. Anything barely surviving on PCIe inside one node will not survive on Ethernet or IB between nodes.

Strong-scaling math: where it falls over

Amdahl's law: speedup is bounded by the serial fraction of the workload, and for distributed training that fraction is communication overhead. For a 70B-class training step on Kentino-class PCIe hardware, scaling efficiency (per-GPU throughput vs single-GPU baseline) looks like this across builds we have shipped:

Configuration	Per-GPU efficiency	Useful regime
1 GPU	1.00× (baseline)	Always
4 GPU, single node, PCIe Gen5 TP	0.82×	Sweet spot for TP
8 GPU, single node, PCIe Gen5 TP (switched)	0.73×	Edge of useful for TP
8 GPU, single node, FSDP / data parallel	0.88×	Strong for DP
2 nodes × 4 GPU, 200 Gbps IB, cross-node TP	0.65×	Painful, rarely worth it
2 nodes × 8 GPU, 200 Gbps IB, TP intra / PP inter	0.74×	Reasonable for big models
4 nodes × 8 GPU, 400 Gbps IB NDR, mixed TP/PP/DP	0.62×	Real cluster work
2 nodes × 8 GPU, 100 GbE RoCE, data parallel only	0.84×	Best multi-node trade for DP

Two takeaways. First, splitting an 8-GPU job into two 4-GPU nodes is worse than running it in one box — every cross-node fabric is slower than the PCIe inside the box you already had. Second, data parallelism scales much better than tensor parallelism across fabric. If your real question is "can I serve more requests" rather than "can I run one bigger model faster," DP replicas work, and they work over commodity 100 GbE.

If projected efficiency drops under 60%, the workload is wrong-shaped for multi-node on commodity fabric. Re-architect (TP inside a node, PP or DP across), buy a bigger single node, or buy SXM-class hardware. Brute force does not work.

The research-lab trap and the operational tax

A pattern we see often enough to call out: a lab plans for "the future" and orders two 4-GPU nodes instead of one 8-GPU node. What they actually get is worse training (0.65× cross-node TP vs 0.73× intra-node), worse inference for any model that fit in one box, twice the operational burden (two BMCs, two NIC tunings, two driver pin states, two failure domains), and roughly the same parts cost after the second NIC, second PSU, and switch upgrade. Buy one 8-GPU node first.

Multi-node, when it is the right answer, is not free. The extra tax:

Shared storage — local NVMe stops being enough. NFS, BeeGFS, or Lustre, plus a storage VLAN (K04).
Async sharded checkpoints — synchronous unsharded writes to NFS stall the cluster. PyTorch DCP or NeMo is required, not optional.
NIC and NCCL tuning — RoCE flow control, PFC, ECN, jumbo frames, NCCL transport choice, topology files, ring vs tree algorithms. Every knob will be wrong out of the box.
Monitoring — DCGM per node, Prometheus federation, NCCL trace buffers.
Failure handling — node disconnects, NIC resets, switch port flaps. K06 covers the modes; multi-node failure rates are roughly N times single-node, recovery is messier.

In engineering time, multi-node costs 4–5× per added node. Plan for it, or accept the cluster spending its first six months at half theoretical capacity.

The concrete decision flow

Walk through this in order. The first "yes" ends the conversation.

Does the model fit in 8× 96 GB at FP8 with 30–40% VRAM headroom for KV? If yes, one node, done.
Does it fit at INT4 with the same headroom? If yes and you are doing inference (not training), one node at INT4 is the answer. INT4 weights are not viable for the gradient path of training — continue.
Is the workload throughput-bound rather than model-size-bound? If yes, the answer is data-parallel replicas of one-node configurations, not a cluster. Two boxes behind a load balancer, no fabric needed.
Is the workload tensor-parallel training of a model that does not fit in one node? Multi-node with InfiniBand. Project scaling efficiency using K07's table. Below 60%, re-architect (TP inside, PP across) or reduce node count and accept slower wall-time.
Is the workload pretraining a 70B+ model from scratch? Frontier case. Multi-node with NDR IB or SXM. Kentino can build the IB side, but most customers asking the question do not actually need to do this work themselves.

Steps one and two are the bulk of the market. Step three means you are growing well — the answer is replicas, not a cluster. Steps four and five are real but rare.

Honest take

Multi-node is right at the frontier — 70B+ training from scratch, 405B+ inference at production latency, hyperscale serving above 100k QPS, or research that depends on per-day throughput one box cannot deliver. These are real workloads. They are not most of what gets built.

For everything else, one well-spec'd 8-GPU node is the answer. It runs every 2026 open-weight inference workload that fits in 768 GB at FP8/INT4, LoRA and QLoRA up to 405B, full fine-tunes of 13B-class without complaint, and scales to two or three DP replicas for throughput with no cluster fabric. And it is dramatically simpler to operate.

The shape of the conversation we have with most customers: describe the workload, do the fit math, project scaling efficiency. If the projection is a cluster, build a cluster. If it is one node, build one node. If it is two replicas behind a router, build that. We are not selling the largest configuration you will tolerate — we are selling the one that will actually work.

What to do next

If you are weighing scale-out before signing:

Write down the model and the workload. Parameter count, quantization, peak concurrent users, target latency, target throughput. The fit and bandwidth math falls out of these numbers; without them the answer is a guess.
Compute weights plus KV cache at target concurrency. Fits in 8× 96 GB with 30% headroom → single-node. Otherwise evaluate multi-node.
Project scaling efficiency for your real configuration. Use K07's table. Under 60% means the architecture is wrong, not the node count.
Separate the throughput question from the model-fit question. "More requests per second" is a replica question. Two 8-GPU boxes behind a router beats one 16-GPU cluster for every latency-sensitive workload we have measured.
Assess operational capacity honestly. Without a storage engineer, a network engineer, and on-call, the second node spends its first quarter at 50% theoretical capacity while you debug NCCL and BeeGFS.
Default to one bigger node, not two smaller. 4-GPU × 2 versus 8-GPU × 1 goes to the 8-GPU box on nearly every dimension.

Companion articles: K02 (training), K03 (inference clusters), K04 (storage), K06 (failure handling), K07 (PCIe limits and the scaling wall), N02 (IB vs RoCE vs Ethernet), N03 (NVLink and when it matters).

This is part of the Kentino Wiki, a reference series on AI compute, robotics, and the systems that connect them. Comments and corrections welcome at info@kentino.com.

Položka sa pridala do vášho košíka