Thermals and Airflow in Multi-GPU AI Server Builds

A multi-GPU AI server is, thermally, an industrial heater that occasionally does math. A 4× RTX 5090 chassis under sustained load dumps 2.4 kW of heat continuously; an 8× 5090 dumps 5 kW. None of it goes anywhere by itself — it lands on the GPU die, the VRMs, the memory packages, and from there into whatever air the chassis can push out the back. If airflow does not match wattage, the silicon throttles, and throttle on an inference server doubles token latency and silently halves throughput. Most "the GPU server got slower" stories are thermal, not software. This is the airflow side of the build, paired with W04 on power.

Heat is just power, restated

Every watt into a GPU comes out as heat — the card does no mechanical work, so there is no efficiency factor. The TDPs we size against:

GPU	Sustained TDP	Hard cap	Hot-spot ceiling	Throttle target
RTX 5090 (FE / partner board)	575 W	~600 W	~95 °C (silicon)	90 °C edge
RTX 4090	450 W	~500 W	~95 °C	83 °C edge
RTX Pro 6000 Blackwell Workstation	600 W	600 W	~90 °C	88 °C edge
RTX Pro 6000 Blackwell Max-Q	300 W	300 W	~85 °C	85 °C edge
L40	300 W	300 W	~87 °C	87 °C edge
L4	72 W	72 W	~87 °C	87 °C edge
Intel Arc Pro B70 32 GB	200 W	225 W	~90 °C	90 °C edge

Two notes that matter for build decisions. NVIDIA raised the 5090's edge throttle threshold to ~90 °C (up from 83 °C on the 4090) — the chip holds full clocks longer at the same airflow, but the silicon runs hotter, which matters for 24/7 deployments. Workstation and datacenter cards (Pro 6000, L40, L4) hold their rated TDP rigidly — they do not boost above the cap. Consumer cards spike. The workstation lineup is easier to cool predictably; the consumer lineup is easier to oversubscribe accidentally.

Throttle thresholds and what they cost

Throttle is a gradient, not a switch. On Blackwell-class silicon:

Edge temperature	Behavior
60–75 °C	Full boost, no throttle
75–85 °C	Mild clock variance, near-full boost
85–90 °C	Boost cap reduces, 5–10% lost
90–95 °C	Hard throttle, 15–25% clock loss
> 95 °C	Aggressive throttle, memory throttle, eventual emergency shutdown

A 5090 inference workload at 590 W cold drops to ~510 W when the edge sensor crosses 90 °C — 15% lost tokens-per-second on a vLLM 70B workload, the difference between hitting an SLO and not. A freshly powered-on card hits its first throttle point 60–120 s into sustained load; benchmarks shorter than 5 minutes overstate sustained throughput by 10–20%, which is one of the most common ways published numbers diverge from production reality.

Front-to-back rack airflow — the only sane architecture for 24/7

GPU cooler topologies split into open-air / axial-tower (consumer gaming cards, exhaust into the chassis interior), blower / radial (reference cards, exhaust out the I/O bracket), and passive datacenter cards (L4, L40 — no fan, chassis fans push air through the finstack). For a 4-GPU or 8-GPU build running 24/7, only blower and passive topologies work in a dense chassis. In a 4U with cards stacked vertically, an open-air design exhausts heat into the intake of the card above; the top card sits in 50–60 °C air and throttles within minutes.

Kentino 4U and 8U chassis use industrial front-to-back airflow with 120 mm fans pushing high static pressure across the GPUs. Cards are blower-style, passive, or actively redirected by chassis ducting. The chassis itself is the cooler.

Front of rack — cold aisle (~22 °C intake)

3× 120 mm intake fans (high static pressure)

↓airflow column

4U chassis interior

GPU 1

GPU 2

GPU 3

GPU 4

PSU · DIMMs · CPU heatsink + fan · cables routed behind tray

↓hot exhaust

Rear of rack — hot aisle (35–45 °C exhaust)

1× 120 mm rear exhaust + PSU exhaust

Front-to-back rack airflow: cold aisle intake → GPUs in airflow column → hot aisle exhaust. This is what holds 5090s under 85 °C at 22 °C intake.

Static pressure vs airflow CFM

Fan datasheets list airflow (CFM) and static pressure (mm H2O). For an open case CFM dominates; for a 4U with dense heatsinks, risers, cable bundles, and passive GPU finstacks in the path, static pressure dominates. A typical 120 mm consumer case fan rates 70 CFM at 1.2 mm H2O; a 120 mm industrial server fan (Delta, Sanyo Denki, Nidec, San Ace) rates 110 CFM at 8–12 mm H2O. CFM gap is 60%; static pressure gap is 7–10×. In a chassis with dense fin pitch, the case fan delivers maybe 20 CFM of actual through-flow; the industrial fan delivers 80–90. This is why the K-AI chassis is loud (55–62 dBA at the rack face) and lives in a rack or closet, not on a desk.

Rules: ~40–50 CFM of through-chassis flow per kW of GPU heat; intake static pressure ≥ 5 mm H2O; CPU cooler must be front-to-back tower style, not top-flow.

Pressure, filters, and cable management

Chassis pressure is intake CFM vs exhaust CFM. Positive (more intake) leaks air out through every gap and traps dust at the front filter; negative pulls unfiltered air through every seam. The Kentino 4U is mildly positive by design — three intake, one rear exhaust, plus PSU exhaust. Filters matter: a 50% clogged intake filter drops chassis airflow 30–40%. Inspect every 90 days in an office, every 30 in a lab. Most "the server got hotter after six months" reports are filter problems, not silicon degradation.

Cables in the front-to-back air column are the most-underestimated thermal problem in multi-GPU builds. A 24-pin ATX bundle slung across the intake side of GPU 4 cuts that card's effective airflow by 25–40% and adds 5–8 °C versus its siblings. Route power and EPS behind the motherboard tray, never across the air column; no cable forward of the GPU midpoint. W04 covers why dual-PSU split delivery makes this physically easier on a 4-GPU build — half the cable mass per side. The dual-PSU choice is as much thermal as it is electrical.

Rack U-spacing and hot exhaust

A 4U at 2.4 kW pushes 35–45 °C exhaust at 100+ CFM; an 8U at 5 kW pushes 40–50 °C at 200+ CFM. Blanking panels in unused U slots are mandatory in any enclosed rack — without them, hot exhaust loops back to the cold-aisle intake. Closed cabinets pushed against a wall are the worst case: upper units sit 8–12 °C hotter than lower ones. One empty U above and below each multi-GPU server in non-contained racks buys 5–8 °C of intake headroom. Hot-aisle containment is meaningful at four-rack scale, overkill for a single rack.

Real measurements — 4-GPU and 8-GPU under sustained load

Internal Kentino test runs, vLLM 70B Q4 inference, 30-min steady-state, 22 °C ± 1 °C room.

Build	Intake	GPU edge	CPU edge	Exhaust	Throttle
4× RTX 5090 (4U, EPYC 9354)	23 °C	76–84 °C	68 °C	41 °C	No
8× RTX 5090 (8U, 2× EPYC 9554)	24 °C	78–86 °C	70–72 °C	46 °C	Edge
4× Pro 6000 Workstation (4U)	23 °C	71–77 °C	67 °C	43 °C	No

The 4× 5090 is the design target — 8 °C spread across the bank, boost held within 30 MHz of nominal. The 8× 5090 sits closer to the limit; GPU 8 at 86 °C is at the edge of where boost cap starts. In rooms warmer than 24 °C, an 8× 5090 build starts losing boost on the rearmost cards — the 8-GPU configuration is the one where install-room ambient becomes a first-class build parameter. The 4× Pro 6000 Workstation runs cooler at the same wall draw because the hard 600 W cap and double-flow-through cooler give a more predictable envelope than the 5090's transient-spiking consumer design.

Hotspots beyond the GPU die

The number nvidia-smi reports is the edge sensor — the GDDR memory edge or the silicon edge, depending on the card. It is not the hottest thing in the chassis. Three other locations matter:

VRMs typically run 10–20 °C hotter than the die under sustained load, with a ceiling around 110 °C. On a 5090 at 575 W, board telemetry shows VRM temps in the 85–95 °C range. Cards with weak VRM cooling throttle on VRM temperature before silicon — invisible to nvidia-smi --query-gpu=temperature.gpu, visible only as unexplained clock loss. If a card runs cool on the GPU sensor but loses boost, suspect the VRM.

GDDR7 memory on the 5090 runs hot. Sustained inference with large activation traffic pushes memory junction temps to 95–100 °C. The card throttles memory clock first (3–5% bandwidth loss), then the GPU clock. For memory-bound workloads, memory temperature is the bottleneck, not core temperature.

NVMe SSDs are the silent killer. A PCIe 5.0 drive doing sustained reads (loading 70B weights, dataset streaming) hits 70–80 °C in seconds without active cooling. Above ~75 °C the controller throttles, and read bandwidth halves. A model load that "should take 8 seconds" takes 16, and nobody knows why. Every K-AI build ships NVMe with heatsinks in the chassis airflow path.

To monitor everything that matters in production:

nvidia-smi --query-gpu=index,temperature.gpu,temperature.memory,clocks.gr,clocks.mem,power.draw \
           --format=csv -l 5

For NVMe, nvme smart-log /dev/nvme0 reports controller and composite temperatures; alarm at 70 °C composite. VRM temperature is exposed on Pro 6000 cards via DCGM (dcgm-exporter for Prometheus); on consumer cards it is board-vendor-specific and often only surfaced in Windows utilities — one of several reasons we prefer workstation cards in long-running production.

Ambient room temperature and the ASHRAE envelope

ASHRAE TC9.9 defines the thermal envelopes datacenter design follows. Class A1 (tier-1 colocation) recommends 18–27 °C inlet; Class A2 (general enterprise) extends allowable to 10–35 °C. The K-AI lineup is designed to A2, but the no-throttle envelope for a 4× or 8× 5090 chassis sits inside A1: 22 °C intake is the design point, 26 °C the practical ceiling before boost loss begins. Humidity matters too: ASHRAE recommends 20–80% non-condensing. Aim for 40–60% RH year-round.

Build	Recommended ambient	Ceiling (no throttle)	Hard ceiling (any throttle)
4× 4090	18–24 °C	26 °C	30 °C
4× 5090	18–22 °C	24 °C	28 °C
4× Pro 6000	18–25 °C	27 °C	32 °C
8× 5090	18–22 °C	23 °C	26 °C
8× Pro 6000	18–24 °C	25 °C	29 °C
8× L40	18–26 °C	28 °C	32 °C
8× L4	18–28 °C	30 °C	35 °C

The L40 and L4 numbers are why those cards remain interesting for office deployments: they tolerate normal office HVAC. An 8-GPU 5090 build needs a server room or closet with dedicated cooling, period.

HVAC sizing in one paragraph

Room cooling load equals sustained wall draw: 1 kW = 3,412 BTU/hr. A 2.4 kW 4-GPU server is ~8,200 BTU/hr; a 4.5 kW 8-GPU server is ~15,400 BTU/hr. Size AC at 1.3× the steady-state load — same headroom rule as PSUs. A 12,000 BTU split on a 2.4 kW server runs 100% duty cycle and kills the compressor in 18–30 months; a 24,000 BTU unit on the same load runs at 50% duty and lasts 8–10 years. Precision (CRAC) cooling becomes relevant above 10 kW; below that a properly-sized split does the job.

Form factor: 4U rack, 8U rack, tower

The K-AI lineup uses three: 4U rack for 4-GPU builds (3× 120 mm intake, 1× rear, dual ATX, 19-inch rack), 8U rack for 8-GPU builds (industrial server fans, CRPS power, dual-CPU motherboard, roughly double the 4U heat density), and tower workstation for 1- and 2-GPU dev boxes (PWM fans, office-friendly). Above 2 GPU we do not ship towers — a 4-GPU vertical chassis hits 90 °C edge on the top card within 20 minutes of sustained load. The same hardware in a 4U rack stays under 85 °C indefinitely.

Liquid cooling — when and why

Air handles ~600 W per GPU in a well-designed 4U; above that, liquid is the answer. AIO per-card drops GPU edge 15–25 °C but adds an order of magnitude of complexity, with pump failure and silent coolant evaporation as the new failure modes. Direct-to-chip with a rear-of-rack heat exchanger plumbed to facility chilled water is the right answer at 16+ GPU per cluster. Immersion in dielectric fluid is efficient, expensive, and changes the serviceability model entirely.

For the current Kentino lineup — air-cooled chassis up to 600 W per card — air is the right answer. A 4× 5090 build runs 78–84 °C edge with zero throttle, 24/7, on a 22 °C cold aisle. Liquid would bring that to 55–65 °C and gain a few percent of boost clock; the capex and complexity delta does not justify it at this scale.

What to do next — thermal monitoring checklist

If you are sizing the thermal side of a build or deployment room:

Cold-aisle ambient in the install room? Measure under realistic load, not on a Sunday with the AC running hard. Compare against the ambient table above.
Room cooling sized at 1.3× server wall draw? An AC sized to exactly match the load runs 100% duty cycle and fails inside two years.
Where does the hot exhaust go? Open rack with a hot aisle is fine; enclosed cabinet without containment, or closet with the server pointed at a wall, is not.
Duty cycle? A dev box at 30% load has different cooling needs than a 24/7 inference server.
Filter and growth plan? A clogged filter quietly halves airflow; a second server doubles heat load. Schedule both.
Telemetry running? nvidia-smi polled at 5 s for GPU edge / memory / clocks / power, nvme smart-log for drives, DCGM for VRM where available, room ambient + humidity in the monitoring stack with alarms at 27 °C and outside 40–60% RH.

Chassis-level design — front-to-back airflow, industrial 120 mm fans, blower or passive GPUs, disciplined cable routing — ships by default in every K-AI build. The room and the rack are the customer's side of the line, and they are where most field issues originate.

W06 (next in the W-series) covers storage tiers — the NVMe, SAS, and bulk pool layouts that pair with these compute chassis.

This is part of the Kentino Wiki, a reference series on AI compute, robotics, and the systems that connect them. Comments and corrections welcome at info@kentino.com.

Το στοιχείο προστέθηκε στο καλάθι σας