GPU Selection for AI Workloads: 5090, 4090, RTX Pro 6000, L40, L4 Head-to-Head

There is no universally correct GPU for AI work in 2026. There is a correct GPU for a defined workload, a defined power envelope, and a defined budget — and the wrong card in the right chassis is a more expensive mistake than the right card in the wrong chassis. This article walks the Kentino lineup head-to-head, with real performance numbers, honest trade-offs, and a decision flow that we have actually used on customer calls. It does not pretend H100 and A100 do not exist; they do, we do not sell them, and we will be specific about when that gap matters.

The cards on the table:

RTX 5090 — 32 GB GDDR7, 1.79 TB/s, 575 W, consumer.
RTX 4090 — 24 GB GDDR6X, 1.01 TB/s, 450 W, consumer, previous generation.
RTX Pro 6000 Blackwell Server Edition — 96 GB GDDR7 ECC, 600 W, passive cooling, server form factor, no display outputs.
RTX Pro 6000 Blackwell Max-Q — 96 GB GDDR7 ECC, 300 W, dual-slot blower, same silicon as Workstation.
L40 — 48 GB GDDR6 ECC, 0.86 TB/s, 300 W, datacenter form factor, full ECC.
L4 — 24 GB GDDR6, 0.30 TB/s, 72 W, single-slot low-profile, edge inference.

The specs that actually matter

GPU spec sheets are dense and most of the numbers do not change a buying decision. Three of them do.

VRAM capacity. This is binary. Either your model fits, or it does not. CPU offload is not a workable substitute (covered in W01).
VRAM bandwidth. Token generation on a transformer is bandwidth-bound. Spec-sheet TFLOPS is largely irrelevant for inference.
Sustained power and form factor. A 600 W card in a chassis that cannot move the heat is a 300 W card with a thermal alarm. A 72 W card in a 1U server is a different machine than a 575 W card in a 4U workstation.

GPU	VRAM	Bandwidth	TDP	Form factor	ECC	Notes
RTX 4090	24 GB GDDR6X	1.01 TB/s	450 W	3-slot consumer	No	Previous gen, cost-down path
RTX 5090	32 GB GDDR7	1.79 TB/s	575 W	2–3 slot consumer	No	Perf/€ king for inference
RTX Pro 6000 BW Max-Q	96 GB GDDR7	1.79 TB/s	300 W	2-slot blower	Yes	High density, lower power
RTX Pro 6000 BW Server Ed.	96 GB GDDR7	1.79 TB/s	600 W	2-slot passive	Yes	Server-grade, headless
L40	48 GB GDDR6	0.86 TB/s	300 W	2-slot passive	Yes	Datacenter Ada generation
L4	24 GB GDDR6	0.30 TB/s	72 W	1-slot LP	Yes	Edge / 1U inference
H100 SXM (reference, not sold)	80 GB HBM3	3.35 TB/s	700 W	SXM5	Yes	Hyperscaler tier
H200 SXM (reference, not sold)	141 GB HBM3e	4.80 TB/s	700 W	SXM5	Yes	HBM bandwidth king

Inference: tokens per second, by model and card

Inference token generation, single stream, is approximately bandwidth divided by model size times a 0.6–0.8 stack efficiency factor. The table below is what we have measured on bench builds with vLLM 0.6+ and llama.cpp current as of Q2 2026. INT4 unless noted. Single-stream throughput first; batched aggregate in parentheses where measurable.

Model	Quant	Size	RTX 4090	RTX 5090	Pro 6000 BW Server/WS	Pro 6000 Max-Q	L40	L4
Qwen2.5 7B	INT4	~4 GB	110–130 (220)	180–220 (340)	180–220 (340)	170–200 (320)	90–110 (200)	35–45 (90)
Llama 3.2 13B	INT4	~7 GB	70–85 (170)	120–140 (250)	120–140 (250)	110–130 (230)	60–75 (140)	22–28 (60)
Qwen2.5 32B	INT4	~18 GB	32–38 (90)	55–65 (140)	60–70 (150)	55–65 (140)	28–34 (80)	does not fit
Llama 3.3 70B	INT4	~40 GB	does not fit single	needs 2× (24–30)	28–34 (90) single card	27–32 (85)	needs 2× (16–22)	does not fit
Qwen2.5 72B	INT4	~42 GB	does not fit single	needs 2× (24–30)	28–34 (90) single card	27–32 (85)	needs 2× (16–22)	does not fit
Qwen2.5-VL 72B	INT4	~46 GB+	does not fit single	needs 2× (12–18)	18–24 single card	17–22	needs 2× (10–14)	does not fit
Llama 3.1 405B	INT4	~210 GB	does not fit	needs 8×	4× (single node)	4× (single node)	needs 5×	does not fit

A few honest caveats. These are typical numbers on a properly cooled chassis with the model fully resident. Cold-cache TTFT is dominated by KV-cache allocation and prefill compute, not bandwidth, and lands in the 200–900 ms range across this range of cards. Batched throughput scales sub-linearly past 8–16 concurrent streams because of compute contention. If your application is interactive (chat, agent step-by-step), single-stream matters more than batched. If your application is bulk (document processing, autolabeling), batched matters more.

The 4× 5090 building block is the workhorse of our lineup for a reason: it costs €8,500–€14,000 in cards alone, fits four GPUs in a 4U chassis with reasonable airflow, and delivers ~12,000 tok/s aggregate on Llama 3.3 70B INT4 under vLLM with tensor parallelism. A single Pro 6000 Blackwell at €8,500 delivers ~30 tok/s single-stream and ~90 tok/s batched on the same model. For a multi-user serving box, the 5090s win. For a single-user large-context workload with 64 GB+ models, the Pro 6000 wins. There is no universal answer.

Where each card actually makes sense

RTX 5090 — the perf/€ king with sharp edges. Right answer when the workload is inference, the budget is real but not unlimited, and the deployment can tolerate two known limitations: no ECC, and consumer-class power transients that need PSU and chassis care (see W04). For 13B and 32B models, the 5090 is faster per euro than anything else on the table. For 70B-class, four 5090s in tensor-parallel deliver more aggregate throughput than a single Pro 6000 Blackwell at lower total capex. Downside: 575 W nominal with 600+ W transients, 32 GB per-card ceiling that forces multi-GPU for 32B+ at high context. Pick when: 24/7 inference for 7B–32B, perf/€ matters, you have rack airflow, ECC is not a compliance hard requirement. Avoid when: ECC mandatory, single-card 70B+, or the room cannot move 2.4 kW of heat.

RTX 4090 — legacy cost-down only. In 2026, a tactical buy. New retail is rare; used and channel-residual is €1,400–€1,900. Per-card ~55% as fast as a 5090 on memory-bound inference (1.01 vs 1.79 TB/s) and 24 GB vs 32 GB — the 8 GB matters because a 32B INT4 model leaves more KV-cache room on a 5090. Still makes sense for capex-constrained expansion of an existing 4090 fleet. Starting fresh? Buy 5090s.

RTX Pro 6000 Blackwell Server Edition — VRAM king for serious workloads. 96 GB ECC GDDR7 at 1.79 TB/s changes which models you can host. A single card holds Qwen2.5-VL 72B INT4 with comfortable KV cache for ~20 concurrent streams. Four in one node hold Llama 3.1 405B INT4 in a single chassis with no inter-node networking. Passive-cooled, designed for front-to-back rack airflow, no display outputs, validated for 24/7. Same silicon as the Workstation Edition, same 600 W cap, different cooling. Pick when: single-card 70B+ headroom, ECC required, rack deployment with proper airflow, training in the mix, or fewer-bigger-cards beats more-smaller-cards on rack space and power.

RTX Pro 6000 Blackwell Max-Q — high-density without rewiring the room. Same 96 GB and 1.79 TB/s, capped at 300 W. Four Max-Q cards draw 1.2 kW from GPUs; four Server Edition cards draw 2.4 kW. The perf penalty for the power cap is real but smaller than the wattage ratio — Blackwell's perf/W curve is steep at the top end, so capping to 300 W loses 20–30% on inference throughput, not 50%. Pick when: power-constrained environment, you want 96 GB per card, density matters more than peak per-card throughput, or acoustics matter.

L40 — the enterprise inference card with ECC and a track record. Ada generation. Slower than Blackwell on bandwidth (0.86 vs 1.79 TB/s) and capacity (48 vs 96 GB), priced like a datacenter SKU. The reason to buy it is procurement: full ECC, validated drivers, sustained 300 W, two-plus years of production deployment. For environments that forbid consumer cards (insurance, government, some regulated industries), this is the card that ticks the box. For raw perf/€ it loses to the 5090. Pick when: procurement policy forbids consumer hardware, workload fits in 48 GB, 24/7 reliability story matters more than perf/€.

L4 — edge inference, 1U, 72 W. The only card on this list that fits in a 1U server alongside the system board without drama, and the only one that runs on the power budget of a laptop. 72 W TDP, single-slot low-profile, passive, 24 GB GDDR6 ECC, 300 GB/s. The bandwidth is the bottleneck — single-stream 7B lands at 35–45 tok/s, which is "fine" not "fast". The use case is fan-out: 8× L4 in a 2U chassis on one EPYC host delivers 8 concurrent 7B inference streams at modest aggregate cost (~€20k in cards), draws under 700 W, and fits any office circuit. Pick when: edge deployment, 1U/2U, power-constrained, model fits in 24 GB, throughput-per-watt is the metric.

Performance per euro: the table you should not show your CFO

GPU	Price (€)	7B INT4 tok/s (single)	tok/s per €1k	70B INT4 tok/s*	70B tok/s per €1k
RTX 4090 (residual stock)	~€1,700	120	70.6	needs 2× = 28	8.2 (4-card cluster basis)
RTX 5090	~€2,400	200	83.3	needs 2× = 28	5.8 (2-card cluster basis)
RTX Pro 6000 BW Max-Q	~€8,500	185	21.8	30 single card	3.5
RTX Pro 6000 BW Server	~€8,800	200	22.7	31 single card	3.5
L40	~€7,800	100	12.8	needs 2× = 19	1.2 (2-card basis)
L4	~€2,500	40	16.0	does not fit	n/a
H100 SXM (reference)	~€28,000	220	7.9	60 single card	2.1

*For 70B INT4: per-card numbers when the model fits on one card; aggregate single-stream throughput when multi-card tensor parallelism is required, divided by total card cost.

The 5090 is the perf/€ king at every model size where it can fit the model. The Pro 6000 cards win on a different axis: 70B-class models on a single card eliminates the latency and complexity overhead of tensor parallelism. The L40 is the worst perf/€ on this table by a wide margin — it costs roughly 3× a 5090 for ~50% of the inference performance. Its value proposition is procurement compliance and an Ada-generation production track record, not raw economics. The L4 is the perf/€ winner specifically at the small-model + low-power corner where it has no competition.

Performance per watt: the table for the colocation manager

GPU	TDP	7B tok/s	tok/s per W	70B tok/s*	70B tok/s per W
L4	72 W	40	0.56	n/a	n/a
RTX Pro 6000 BW Max-Q	300 W	185	0.62	30	0.10
L40	300 W	100	0.33	19 (×2)	0.03
RTX 5090	575 W	200	0.35	28 (×2)	0.024
RTX 4090	450 W	120	0.27	28 (×2)	0.031
RTX Pro 6000 BW Server	600 W	200	0.33	31	0.052
H100 SXM (reference)	700 W	220	0.31	60	0.086

The Max-Q wins perf/W in this lineup, and it is not close. Capping a 96 GB Blackwell at 300 W keeps the card in the efficient part of its curve, and you get most of the throughput of the Server Edition at half the wall draw. For colocation where power is metered and you pay €0.18–€0.30 per kWh continuously, the Max-Q saves real money over a multi-year deployment versus the Server Edition. We have customers who shifted from Server Edition to Max-Q specifically to avoid upgrading their building's chiller plant.

Training and fine-tuning notes

Training is not Kentino's primary positioning — most customers buy inference. But fine-tuning shows up everywhere, and the choice for training has different constraints. Full-parameter training of 70B+ models is not viable on this lineup; that requires 8× H100/H200 SXM or rented cloud, and we will say so. LoRA fine-tuning of 7B–32B works comfortably on 4× 5090 or 4× Pro 6000 BW Max-Q. QLoRA of 70B prefers 2× Pro 6000 BW (any edition) over 4× 5090 with FSDP because one card per model replica is dramatically simpler. The decision rule: if training runs are over 24 hours and unattended, ECC matters — pick Pro 6000 or L40. Under 24 hours with a human in the loop, 5090 is fine and faster per euro.

Vision-language and the Pro 6000 vs H100 question

VLMs change the calculus because the activation footprint is bigger and the prefill (image encoding) is more compute-bound. For Qwen2.5-VL 72B INT4 (~46 GB), the Pro 6000 BW delivers 18–24 tok/s on a single card with ~1.4 s prefill; 2× 5090 in tensor-parallel delivers 12–18 tok/s with 20–40 ms per-token TP overhead. For robotics on-prem inference, the Pro 6000 BW is the more honest pick because Qwen2.5-VL 72B is the model people actually want to run, and one card eliminates TP overhead. For autolabeling pipelines and bulk image-to-text where latency does not matter, 4× 5090 still wins on perf/€.

Honest comparison: Pro 6000 BW vs H100

We do not sell H100. We will be specific about the trade-off because customers ask.

Per single card, H100 SXM (80 GB HBM3, 3.35 TB/s) beats Pro 6000 BW Server (96 GB GDDR7 ECC, 1.79 TB/s) on bandwidth-bound single-stream inference by roughly 1.5–1.9× — so 60 tok/s vs 31 tok/s on Llama 3.3 70B INT4. H100 also has NVLink and the SXM5 mezzanine connector, which buys 900 GB/s GPU-to-GPU interconnect in an HGX 8-GPU node. Pro 6000 BW has PCIe 5.0 x16 (~63 GB/s effective), about 14× slower for cross-card traffic.

For inference of models that fit in 96 GB on a single card, this difference is invisible — there is no cross-card traffic. For inference of models that need to be sharded across 4× or 8× cards, H100 with NVLink wins by 30–50% on aggregate throughput because tensor parallelism is interconnect-sensitive. For training across 8 cards, H100 wins decisively.

The price gap is 3–3.5× per card and 8–12× per usable node (HGX H100 includes the carrier board and NVSwitches). For most non-hyperscale workloads, that ratio does not pencil. For workloads where it does, the customer is not buying from Kentino — they are buying from Dell, Lenovo, or Supermicro direct in 8-figure deals. We will say this on the phone too.

What we will not say: that the Pro 6000 Blackwell is "as good as" or "competitive with" an H100. It is not, on the metrics where H100 was designed to win. It is, however, the right card for the use cases where 96 GB ECC at 1.79 TB/s solves the actual problem the customer has — which is most of them.

Decision flow

Start: What is the workload?

Inference only?
- Single-stream interactive (chat, agent, voice)?
  - Model fits in 32 GB (7B–32B INT4)?
    - Budget is tight: 4× RTX 5090
    - ECC required (compliance): 4× L40
    - Power-constrained office: 4× Pro 6000 BW Max-Q
  - Model needs 32–80 GB (70B INT4, VLM 72B):
    - Want single-card simplicity: 1–2× Pro 6000 BW Server
    - Perf/€ priority, accept 2-way TP: 4× RTX 5090
    - Power-capped: 2× Pro 6000 BW Max-Q
  - Model 80 GB+ (405B INT4, multi-model hosting):
    - 4× or 8× Pro 6000 BW Server in 8-GPU chassis
    - Consider whether cloud is honestly the right call
- Batched bulk (autolabeling, document processing)?
  - Small model (7B–13B): 8× L4 in 2U (edge) or 4× 5090 (rack)
  - Large model (70B+): 4× Pro 6000 BW Server or 8× 5090
- Edge / 1U / power-constrained?
  - 1–8× L4
Training or fine-tuning?
- LoRA / QLoRA / fine-tune (most customers):
  - 7B–13B: 4× RTX 5090 (ECC not critical)
  - 32B–70B: 4× Pro 6000 BW Server (ECC + capacity)
  - Long unattended runs: always pick ECC parts
- Full-parameter 70B+ training: not viable here — recommend cloud or DGX-class
- Diffusion / VLM fine-tune: Pro 6000 BW for batch size, 5090 for perf/€ on smaller batches
Mixed (training + inference, research lab)?
- 4-GPU: 4× Pro 6000 BW Server (Max-Q if power-capped)
- 8-GPU: 8× Pro 6000 BW Server in dual-EPYC chassis
- Mix-and-match: 4× 5090 inference + 1× Pro 6000 BW training in same chassis is workable, not pretty

The branch that resolves to "4× RTX 5090" is the single most common build we ship. The branch that resolves to "4× Pro 6000 BW Server" is the second. The L4 branch and the Max-Q branch are smaller in volume but neither is a niche — every quarter we ship multi-unit deals into office deployments where 600 W cards would not survive the building electrical.

What we do not stock

Stated plainly: Kentino does not sell H100 SXM, H200 SXM, A100 SXM, B200, or GB200 NVL-class hardware. The SXM5 form factor and the HGX/NVL carrier-board ecosystem live in a tier of supply chain we are not in. PCIe H100 variants existed briefly and are essentially gone from the channel. If your workload genuinely requires 8× H100 with NVLink, your honest options in May 2026 are: rent from a hyperscaler or specialist cloud, buy direct from Dell / Lenovo / Supermicro with a 12–20 week lead, or work with an integrator at that tier.

We do not stock AMD Instinct MI300X or MI325X either — strong on paper for memory-bound inference (192 GB HBM3, 5.3 TB/s on MI300X), but ROCm software maturity and channel availability in Czechia are not where the customer base is for us today.

Where the analysis lands for typical buyers

Research lab, first inference server: 4× RTX 5090 on EPYC Turin with 192 GB RAM, dual ATX PSU, 4U rack chassis. Runs every model up to 70B INT4 across TP, headroom for fine-tuning.
Startup serving production inference: 4× Pro 6000 BW Server in 4U with EPYC Genoa/Turin host, 384–512 GB RAM, CRPS PSU with 1+1 redundancy. ECC, headless, single-card 70B+ headroom.
Robotics lab + on-prem compute: 4× Pro 6000 BW Server. The 96 GB lets you host Qwen2.5-VL 72B and an LLM together, ECC matters because inference output drives physical hardware.
Enterprise procurement buying for compliance: 4× or 8× L40 in a Supermicro chassis. Worse perf/€, but every BOM line passes audit.
Branch office, retail, edge: 4× or 8× L4 in 1U/2U. Office power, no special HVAC.
Existing 4090 fleet expansion: more 4090s if budget is binding and you can find them; otherwise 5090s mixed in (vLLM handles mixed-generation TP acceptably; do not mix 4090 with Pro 6000 — the bandwidth disparity ruins the worst-card-wins effect).

What to do next

Before specifying GPUs, answer these five questions:

List every model you need to host concurrently. Sum INT4 footprints. Add 40–60% for KV cache at target batch and context. That is your minimum VRAM, total and per-card.
State the latency target. Single-stream tok/s under 30 means you can use almost anything. Single-stream tok/s above 60 narrows you to 5090 or Pro 6000 BW. Bulk throughput-per-day is a different metric and changes the answer.
State the power envelope at the wall. Single-phase 16 A means 4× consumer GPUs maximum. Three-phase or 32 A means 8-GPU is on the table. Office 10 A circuit means L4 or Max-Q only.
State the procurement constraint. "We only buy enterprise SKUs" → L40 or Pro 6000 BW Server. "We buy whatever ships" → 5090. Be honest with yourself; this is the constraint that derails the most builds late in the process.
State the duty cycle and lifetime. 24/7 for three years pays back ECC and Platinum PSUs. Bench machine for development does not.

If you cannot answer all five, no GPU choice will look right in hindsight. If you can, the right answer falls out of the table above on a single call. See W05 for thermals and airflow, W06 for storage tiers, and W01 for the RAM-to-VRAM sizing rules that underpin GPU selection.

This is part of the Kentino Wiki, a reference series on AI compute, robotics, and the systems that connect them. Comments and corrections welcome at info@kentino.com.

Артикулът е добавен в количката