Inference Clusters: vLLM Tensor Parallel, Pipeline Parallel, and What Each One Actually Costs You
A 70B-class model does not fit on one GPU at any quantization that leaves useful KV cache. A 405B model does not fit on one node. Once that is true, the question is no longer "which GPU" but "how do I cut the model across the GPUs I have, and what does each cut cost me?"
This article covers the four ways vLLM lets you slice a model — tensor parallel, pipeline parallel, expert parallel, data parallel — what each one does to your bandwidth bill, and how to pick between them on the Kentino lineup (PCIe-attached 5090, RTX Pro 6000 Blackwell, L40, L4 — no NVLink-fabric SXM parts). The audience has read I01, knows what vLLM is, and now needs to make config decisions.
The four dimensions of cutting a model
Every distributed-inference framework — vLLM, SGLang, TensorRT-LLM, Triton — exposes some combination of the same four axes. They are not alternatives; they compose.
| Axis | What it splits | Communication per token | Bandwidth-sensitive? | Latency impact |
|---|---|---|---|---|
| Tensor parallel | Each layer (matmul shards) | All-reduce per layer (×2) | Yes — heavy | Reduces latency |
| Pipeline parallel | Layers across stages | Activations per stage boundary | No — light | Adds latency, raises throughput |
| Expert parallel | MoE experts across GPUs | All-to-all per MoE layer | Yes — bursty | Model-dependent |
| Data parallel | Whole replicas, independent | None during inference | No | Same latency, N× throughput |
Column three is the entire game. TP shouts across the bus on every layer. PP whispers between two stages. That single fact decides whether you keep TP inside one box, push PP across nodes, and where the line falls.
Tensor parallel — how it actually works
In TP, every weight matrix in the transformer (attention QKV, attention output, FFN up, FFN down) is sliced across tensor_parallel_size GPUs. Each GPU stores one shard of every layer and computes one shard of every activation. Because attention and FFN contain matmuls that need the full activation reassembled before the next op (softmax, SwiGLU), partial results must be combined. vLLM does this with an all-reduce at the end of attention and another at the end of the FFN — two per transformer block.
Llama 3.3 70B has 80 layers, hidden 8192. At batch 32 decode, each all-reduce moves ~512 KB, ×160 per generated token → ~80 MB per token across the bus. Prefill is dramatically worse: a 4 K prefill at batch 32 pushes on the order of 300 GB through the all-reduce ring in one forward pass.
That number is why TP loves NVLink. SXM H100/B200 NVLink is 900 GB/s. PCIe Gen 5 x16 is 64 GB/s unidirectional, 128 GB/s bidirectional best case — rarely the case on a 4-GPU board (lanes are usually shared, see W02). The 14×–28× gap shows up in benchmarks: NVLink scaling efficiency lands ~0.92×/card, PCIe ~0.70–0.78×/card on 70B-class models.
Practical consequence: TP scales well to 4 GPUs in one PCIe Gen 5 node. Beyond that, all-reduce costs more than the parallelism saves and you should reach for pipeline parallel instead.
vLLM config: tensor_parallel_size
--tensor-parallel-size N tells the engine to shard every weight tensor across N GPUs in the local node. Constraints:
-
Nmust divide the model's attention head count (Llama 70B has 64 heads → N ∈ {1, 2, 4, 8, 16, 32, 64}). - vLLM places TP ranks on the same node and assumes a fast intra-node bus.
- KV cache is sharded along the head dimension — each GPU stores
total_heads / Nheads' worth. Higher TP gives more KV headroom per request.
On Kentino hardware: TP=4 on a 4× RTX 5090 or 4× RTX Pro 6000 Blackwell box is the sweet spot. TP=8 works but the PCIe bus groans; you are usually better off with TP=4 × PP=2 inside an 8-GPU box.
Pipeline parallel — the across-the-room option
PP splits the model by depth. With pipeline_parallel_size=2, GPU 0 holds layers 0–39 of a 70B Llama, GPU 1 holds 40–79. A request flows through GPU 0, the activation tensor ships to GPU 1, GPU 1 finishes the forward pass.
Communication is one tensor of shape (batch, seq_len, hidden_size) per stage boundary. For batch 32, seq 4096, hidden 8192, FP16, that is ~1 GB per prefill and ~0.5 MB per decode token at batch 32 — two orders of magnitude less than TP all-reduce. PP runs comfortably across plain 25 GbE or even 10 GbE.
The trade-off is latency. With PP=2, every token hops between stages — naively 2× per-token wall time. vLLM mitigates this with micro-batching: stage 0 starts the next micro-batch while stage 1 finishes the current one. With enough concurrency the bubble closes; with one request and no batching, PP is a latency tax for no gain.
vLLM config: pipeline_parallel_size
--pipeline-parallel-size M splits the model by layers across M groups. Total GPUs = tensor_parallel_size × pipeline_parallel_size. Docs guidance:
- Single node, ≤ 8 GPUs, model fits: pure TP,
pipeline_parallel_size=1. - Multi-node: TP within the node, PP across nodes. A two-node 8-GPU cluster runs TP=4, PP=2.
- GPU count does not divide head count cleanly: set TP=1, PP=GPUs. PP doesn't care about head count. (A 5-GPU box — one slot lost to a NIC — can run Llama only via PP=5.)
Expert parallel — for MoE only
MoE models (DeepSeek-V3, Mixtral, Qwen-MoE flavors) do not activate every parameter on every token. They have routed FFN layers where only a small subset of "experts" fires per token; dense attention layers stay dense.
Expert parallel (EP) shards experts across GPUs while keeping dense layers under TP or DP. With --enable-expert-parallel, expert layers switch from replicated to partitioned, one or a few experts per GPU. The communication pattern is all-to-all per MoE layer: tokens route to whichever GPU owns the target expert, compute, return.
EP is bandwidth-bursty. It makes large MoE models tractable on PCIe clusters at all — full TP on a 671B-active model is hopeless. For Kentino deployments EP is relevant only for DeepSeek-V3-class models; dense Llama 70B does not benefit. vLLM's built-in EP plus a recent build is the default entry point.
Data parallel — the boring, brilliant axis
DP is the easiest scaling axis and the one most installations under-use. You spin up N identical copies of the model, each on its own set of GPUs (each set may itself use TP and/or PP). A load balancer sprays requests to whichever replica has capacity.
What DP gives:
- Linear throughput scaling (N× requests/sec).
- Zero inter-replica communication during inference.
- Independent KV caches per replica (prefix cache is per-replica).
- Trivial failure isolation.
What DP costs:
- N× the GPU memory — each replica holds the full model.
- No latency reduction. A single request takes whatever it takes on one replica.
If you have a 4× RTX Pro 6000 box and Llama 70B fits in TP=4, a second 4× box gives DP=2 × TP=4 — doubled throughput, same per-request latency. For chat, agent, and RAG workloads, that is the right trade. vLLM's --data-parallel-size flag (and the newer data_parallel_deployment mode) launches and manages replicas. DP is the cleanest way to scale past one box.
Combining the axes — the rule of thumb
Axis selection: start with the smallest parallelism that fits, then add DP before PP.
Worked example. Serving Llama 3.3 70B (FP8 ≈ 75 GB weights, plus KV) at high concurrency:
- A 4× RTX Pro 6000 Blackwell box (4 × 96 GB = 384 GB) runs it comfortably under TP=4, with ~250 GB left for KV, prefix cache, and CUDA graphs.
- Add a second 4× Pro 6000 box. DP=2 × TP=4. Two replicas behind a router. Doubled throughput, same latency.
- Llama 3.1 405B at FP8 (~400 GB weights)? One 4× Pro 6000 box does not fit. Two boxes via PP=2 × TP=4 do — and the cross-node link is moving activations, not all-reduce. 25 GbE is enough; 100 GbE is comfortable.
KV cache: the part everyone underestimates
The KV cache is the cumulative attention key/value tensors for every prompt token, every generated token, every concurrent request, every layer. It grows linearly with context length and concurrency. Llama 70B at 8 K context needs roughly 2.5 GB of KV cache per request in FP16. At 32 concurrent requests, that is 80 GB — more than a 5090's entire VRAM.
How parallelism interacts with KV:
-
Under TP, KV is sharded by attention head across the TP group. Per-GPU KV =
total / tensor_parallel_size. Higher TP → more concurrent-request headroom. - Under PP, KV stays on the GPU holding the layer that produced it. Each stage owns its own KV.
- Under DP, KV is fully independent per replica.
- Under context parallel (a newer vLLM mode), KV is sharded along the sequence dimension — useful for very long single-request contexts.
When sizing a box, do not just check whether the weights fit. Run the KV math at your target concurrency and context window. The most common silent failure in production vLLM is the engine preempting requests under KV pressure.
Request routing — what sits in front of the cluster
A single vLLM instance handles its own internal batching (continuous batching, prefix caching, scheduling). It does not route across replicas. That is the router's job.
| Router | Awareness | When to use it |
|---|---|---|
| Plain NGINX | None (round-robin) | One-model deployments, simplicity wins |
| HAProxy | None + health-check | Multi-model, header-routed |
| vLLM Router (Rust) | KV / prefix / queue | ≥4 replicas, prefix-aware routing matters |
| llm-d (Kubernetes) | All of above + EP | K8s fleets, MoE, prefill/decode disaggregation |
NGINX is the right default for a 2-replica install — round-robin, health-checks on /health, done. The vLLM Router (Rust, released late 2025) is the right call once prefix-cache hit rate dominates your tail latency: it routes on consistent hashing of the prompt prefix so cache-warm replicas keep getting the same conversations. For an agent workload with long shared system prompts, this can double effective throughput vs. round-robin.
The bandwidth math
| Workload | Bandwidth needed | Kentino-realistic link |
|---|---|---|
| TP=4 within one box (PCIe Gen 5) | 50–200 GB/s per pair | Intra-node PCIe |
| PP across two nodes, batch 32, decode | 0.05–0.2 GB/s | 10 GbE — comfortable |
| PP across two nodes, batch 32, prefill | 1–4 GB/s burst | 25 GbE comfortable, 10 GbE marginal |
| DP across two nodes | ~0 (router only) | 1 GbE management fine |
| EP across 8 GPUs in one box (MoE) | 20–80 GB/s bursty | Intra-node only |
| EP wide across 2 nodes (DeepSeek-V3 class) | 10–40 GB/s sustained | 100 GbE RoCE or InfiniBand |
The honest read: TP and EP want to stay inside a box. PP and DP do not care. With a 10–25 GbE cross-node link, PP and DP are fine. The moment you want TP across nodes, you are paying for InfiniBand HDR or 200 GbE RoCE — and you should first ask whether DP across one-node TP groups gets you the same result for a tenth the budget. For most Kentino-sized deployments, it does.
Two concrete config recipes
Recipe A — One node, 4× RTX 5090, Llama 70B Q4 / FP8
Hardware: K-AI 256 Turin Dual or any 4-GPU 5090 box. PCIe Gen 5, no NVLink, AMD EPYC host.
vllm serve meta-llama/Llama-3.3-70B-Instruct-FP8 \
--tensor-parallel-size 4 \
--pipeline-parallel-size 1 \
--max-model-len 8192 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching \
--max-num-seqs 64 \
--dtype auto \
--port 8000
Expected: roughly 30–40 tok/s per request at low concurrency, aggregate ~400–600 tok/s at 32 concurrent (varies with prompt mix, prefix-cache hit rate, exact quant — treat as a starting envelope). PCIe Gen 5 all-reduce is the bottleneck on decode; prefill scales near-linearly.
Recipe B — Two nodes, 8× RTX Pro 6000 Blackwell total, Llama 405B FP8
Two K-AI boxes, each 4× Pro 6000 (96 GB). 100 GbE RoCE link between them, or 25 GbE if budget is tight (will work, slightly slower prefill).
# Node 0 (head):
vllm serve meta-llama/Llama-3.1-405B-Instruct-FP8 \
--tensor-parallel-size 4 \
--pipeline-parallel-size 2 \
--distributed-executor-backend ray \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--enable-prefix-caching \
--max-num-seqs 32
# Node 1 (worker, joined via Ray):
ray start --address=<head-ip>:6379
TP=4 inside each node uses PCIe Gen 5 for the all-reduces. PP=2 across nodes ships activations over 25/100 GbE. With Ray as the distributed backend (vLLM's default for multi-node), the head coordinates scheduling and KV state.
Honest performance call: 405B at FP8 across 8× Pro 6000 over PCIe + Ethernet lands around 6–12 tok/s per request — well below an 8× SXM B200 chassis, at a fraction of the capex and without the SXM supply problem. If your SLA is "answer in 30 s for a 500-token completion," it works. If it is "answer in 2 s," do not use a 405B — use a 70B.
What we are not running
NVLink, NVSwitch, SXM B200, full HGX boards: not in the Kentino lineup. They are the right answer if you have the budget and the workload. They are not the right answer for most of our customers, who size for 1–4 concurrent agent workflows or a single robot platform, not 1 000-user SaaS inference. The PCIe path is honest about what it can and cannot do. Sub-10 ms per-token TP across 16 GPUs is a different conversation — not the cluster this article is about.
What to do next
If you are putting a vLLM cluster together, work through these in order:
- Write down the model and the SLA. Parameter count, quantization, target tok/s per request, target concurrent requests, target context window. Without these numbers the parallelism choice is a guess.
- Compute weights + KV at target concurrency. If KV alone exceeds one GPU's spare VRAM, you need TP. If weights exceed one node, you need PP.
- Start with the smallest TP that fits. TP=2 before TP=4 before TP=8. Each step up loses scaling efficiency on PCIe.
- Add DP for throughput before adding PP. Two nodes via DP almost always beats one node split via PP for latency-sensitive workloads.
- Reserve PP for the model-does-not-fit case or for spanning a node count that TP cannot cleanly divide.
- Put a router in front, even with two replicas. Round-robin NGINX is enough to start; upgrade to vLLM Router when prefix-cache hit rate matters.
-
Monitor KV utilization, not just GPU utilization. A cluster at 95% GPU and 100% KV is preempting requests. The dashboard you want is
vllm_kv_cache_usage_percover time.
Follow-ups in this track: cluster storage (K04), scheduling (K05), failure handling (K06), and the PCIe-as-interconnect ceiling (K07). The networking math is unpacked in N02 and N06.
This is part of the Kentino Wiki, a reference series on AI compute, robotics, and the systems that connect them. Comments and corrections welcome at info@kentino.com.