Case Study: 4x RTX 4090 AI Workstation

This article documents a complete build commissioned for a research customer who needed a rack-mountable, 24/7-capable LLM inference workstation with enough VRAM to host 70B-class models without cloud dependency. Everything here is measured on the actual hardware. No synthetic estimates, no marketing numbers.

The build was commissioned and delivered in April 2026. Commissioning benchmarks were run on 2026-04-10.

Why 4× RTX 4090

The workload requirement was clear from the start: run a quantized 70B LLM at usable single-request latency, serve concurrent requests in a small research team setting, and keep everything on-prem for data control reasons. The question was how many GPUs of what kind.

A 70B model at INT4 (AWQ or GGUF Q4_K_M) occupies roughly 38–40 GB of VRAM in practice. That eliminates single-GPU solutions entirely — even an RTX 4090 at 24 GB cannot host the model alone. You need at least two, and preferably four, so that tensor-parallel serving under vLLM has headroom for the KV cache.

Four RTX 4090 cards give 96 GB of total VRAM. That is enough to load a Llama 3.3 70B AWQ INT4 model with gpu_memory_utilization=0.80 and still have meaningful KV cache space for batched requests. It also gives compute capacity — 4× 128 SM chips running simultaneously — that matters for prompt processing speed.

The alternative considered was 4× RTX Pro 6000 Blackwell, which would give 4× 96 GB = 384 GB VRAM. That is a completely different tier: more VRAM than any single 70B model needs, suited to running multiple large models concurrently or hosting 200B+ class models at reasonable quantization. For this workload — one primary model, small concurrent batch, on a single-tenant workstation budget — that extra capacity would go unused and the cost difference is significant. The 4× RTX 4090 was the right answer for the stated use case.

An 8× L40 alternative also exists in the lineup. L40 gives 48 GB VRAM per card, ECC support, and is designed for sustained datacenter load. For this customer, the workload did not require datacenter-grade reliability contracts, the L40's lack of consumer-class driver quirks was a minor benefit, and the budget did not extend there. Worth knowing as an upgrade path.

One architectural ceiling to name upfront: RTX 4090 cards have no NVLink. All GPU-to-GPU communication happens over PCIe peer-to-peer. This is meaningful for tensor-parallel inference (discussed in the benchmark section) and worth understanding before ordering. See N03 for a full treatment of when NVLink matters and when it does not.

Hardware Specification

Component Detail
CPU AMD EPYC 7542 — 32 cores / 64 threads, 2.9 GHz base
Motherboard ASRockRack ROMED8-2T/BCM, rev 3.01
RAM 512 GB DDR4 ECC LRDIMM — 8× 64 GB SK Hynix @ 2666 MT/s
GPU 4× NVIDIA GeForce RTX 4090 — 24 GB VRAM each, 96 GB total
NVMe (model storage) 2 TB PCIe 4.0 NVMe, mounted at /mnt/nvme/models/
OS disk 512 GB SATA SSD — 100 GB LVM partition allocated, remainder reserved
OS Ubuntu 24.04.4 LTS, kernel 6.8.0-107-generic
Driver NVIDIA 590.48.01 (open kernel module)
CUDA 13.1 (toolkit 13.2), cuDNN 9.20.0
Form factor 4U rack-mount, front-to-back directed airflow
PSU Dual ATX — split power delivery (not N+1 redundant)

A few notes on the choices:

CPU. The EPYC 7542 is a 32-core Rome-generation chip with 128 MB of L3 cache across 8 CCDs. For an inference workstation this is oversized on raw core count — you will not saturate 64 threads on pure inference. Where it earns its place is in the PCIe lane budget: Rome EPYC provides 128 PCIe 4.0 lanes from the CPU, which is why you can put four full x16 GPUs on this platform without lane sharing or bifurcation. The RTX 4090 is a PCIe 4.0 x16 device; you want four full x16 slots, and EPYC gives them. See W02 for lane topology detail.

RAM. The system shipped with 512 GB DDR4 ECC LRDIMM, which exceeded the originally specified 256 GB. Extra system RAM is genuinely useful here: model loading from NVMe is staged through system RAM before transfer to VRAM, and for larger models or when swapping between models, 512 GB means you can hold multiple model weight sets in RAM simultaneously and avoid repeated NVMe reads.

NVMe. Model storage (2 TB) is on a high-speed PCIe 4.0 NVMe drive. Sequential read speed matters at load time: a 38 GB model file at 4,589 MB/s sequential reads loads in roughly 8–10 seconds. Measured load times confirm this: llama.cpp loaded the 70B Q4_K_M GGUF in 10.8 seconds; vLLM (which also builds CUDA graphs on load) took 95 seconds.

Dual PSU. The chassis uses two ATX power supply units. This is split power delivery: each PSU feeds a portion of the system — typically two GPUs and associated board components each. If one PSU fails, you lose the GPUs on its rail. This is not N+1 redundancy; it is a power capacity arrangement, not a failover arrangement. For production systems where uptime is contractual, this distinction matters. See W04 for the full PSU sizing discussion.

Chassis airflow. The chassis is rack-mount with industrial front-to-back directed airflow. The GPUs are not open-air; they sit in a directed airflow channel with case fans pulling air across heatsinks and exhausting out the rear. This makes the system suitable for 24/7 sustained operation in a rack environment. See W05 for thermal design detail.

Software Stack

Package Version
PyTorch 2.10.0+cu128
vLLM 0.19.0
llama-cpp-python 0.3.20 (CUDA/cuBLAS)
transformers 4.57.6
HuggingFace Hub, bitsandbytes, accelerate current at build date

The Python environment lives at /home/logic/llm-env/, models on the NVMe at /mnt/nvme/models/.

Commissioning Results

GPU Compute Baseline

First test is always raw compute: matrix multiplication at FP16 (Tensor Core path) and FP32. This confirms the cards are functioning correctly and gives a compute baseline to compare against nominal specifications.

GPU FP16 (8192×8192) FP32 (4096×4096) Compute cap
GPU 0 171.7 TFLOPS 59.5 TFLOPS 8.9
GPU 1 162.1 TFLOPS 54.9 TFLOPS 8.9
GPU 2 171.0 TFLOPS 58.5 TFLOPS 8.9
GPU 3 171.2 TFLOPS 60.1 TFLOPS 8.9

For reference: RTX 4090 theoretical FP16 Tensor Core peak is ~330 TFLOPS, FP32 ~82.6 TFLOPS. The benchmark numbers at ~52% of FP16 peak are expected — the measurement is a sustained-throughput matmul that does not achieve the theoretical peak of a hand-tuned GEMM kernel. They confirm all four Tensor Core arrays are working and consistent.

GPU 1 is ~6% lower on FP16 than the others. This is normal silicon variation within the same bin. No hardware fault.

GPU Memory Bandwidth

Path Per GPU
VRAM internal (device copy) ~920 GB/s
Host → Device (PCIe) 26.2–26.3 GB/s
Device → Host (PCIe) 1.4 GB/s
GPU ↔ GPU (PCIe peer-to-peer) 19–22 GB/s

The 920 GB/s VRAM bandwidth is in line with RTX 4090 spec (1,008 GB/s peak; the gap is benchmark overhead). This bandwidth figure is what matters for decode throughput: each generated token requires loading the full set of KV cache and weight tensors from VRAM, so bandwidth directly sets the ceiling on how fast you can generate.

The GPU-to-GPU figure (19–22 GB/s over PCIe peer-to-peer) is the architectural constraint relevant to tensor-parallel serving. With NVLink, this path runs at 900 GB/s. With PCIe only, you get roughly 2% of that. This is not catastrophic for inference — most of the tensor-parallel communication on a 70B AWQ INT4 model across 4 GPUs fits within what PCIe can handle — but it does suppress single-request decode speed compared to an NVLink-connected system. See the benchmark section below, and N03 for the wider discussion.

One anomaly: Device-to-Host bandwidth measured at 1.4 GB/s, far below the expected ~26 GB/s for PCIe Gen4 x16. This is a known CUDA behavior with non-pinned memory. If your application moves data from GPU to host frequently (e.g. sampling from output logits in a custom pipeline), use torch.pin_memory() or preallocate pinned buffers. Standard vLLM and llama.cpp serving pipelines do not trigger this path in the hot loop.

All PCIe links confirmed Gen4 x16 (16 GT/s) under load. At idle, the driver uses ASPM power saving and links drop to Gen1 (2.5 GT/s) — this is normal and not a wiring or riser fault.

NVMe Storage

Test Throughput IOPS
Sequential read (1 MB blocks) 4,589 MB/s 4,376
Sequential write (1 MB blocks) 4,213 MB/s 4,017
Random read 4K (QD32) 2,325 MB/s 568,000
Random write 4K (QD32) 2,273 MB/s 555,000

These are strong NVMe numbers for a consumer/prosumer PCIe 4.0 drive. For model serving, the relevant number is sequential read: 4,589 MB/s means a 38 GB model loads to RAM in approximately 8–9 seconds before any VRAM transfer. The 568K random IOPS is more relevant if you are running a retrieval pipeline (RAG, vector store) where the workload is many small random reads — this drive handles that without becoming the bottleneck.

vLLM Inference — Llama 3.3 70B AWQ INT4

This is the core benchmark. Model: casperhansen/llama-3.3-70b-instruct-awq. Serving: vLLM 0.19.0 with tensor_parallel_size=4, max_model_len=2048, gpu_memory_utilization=0.80.

Test Result
Model load time 95.0 s
Single request, 512 max tokens — throughput 8.0 tok/s
Single request, 512 max tokens — wall time 64.3 s
Batch (32 concurrent requests, 256 max tokens) — aggregate 179.3 tok/s
Batch — per-request average latency 1,428 ms
Short prompt, 16 max tokens — average latency 2,043 ms

The 8.0 tok/s single-request decode speed requires context. vLLM 0.19.0 ran the AWQ model with the awq kernel, not awq_marlin. The awq_marlin kernel is the faster path for AWQ on Ada Lovelace (RTX 4090) and Blackwell GPUs — the benchmark notes indicate it was not selected during this commissioning run, and the improvement is expected to be 2–3× on single-request decode speed. With awq_marlin, the same model on the same hardware should reach approximately 16–24 tok/s single-stream.

The 179.3 tok/s aggregate on 32 concurrent requests is the more production-relevant number. This is what a small team hitting the endpoint simultaneously will see as combined system output. Continuous batching in vLLM means concurrent requests amortize the KV cache and attention computation across the batch, which is why 32× the users does not produce 32× the latency.

The 2,043 ms latency on a 16-token request is the TTFT (time-to-first-token) floor under vLLM on this configuration. For interactive use cases (chat, code assist), this is on the slower side. The main driver is tensor-parallel scatter/gather overhead across four GPUs over PCIe — every prefill step requires an AllReduce across all four cards via the PCIe fabric. With NVLink this would be roughly 50–100 ms TTFT; with PCIe P2P at 20 GB/s it stretches further. This is the direct cost of the no-NVLink architecture on latency-sensitive single requests (see N03).

llama.cpp — Llama 3.3 70B Q4_K_M GGUF

Model: Llama-3.3-70B-Instruct-Q4_K_M.gguf. Backend: llama-cpp-python 0.3.20 with CUDA/cuBLAS, all layers offloaded to GPU.

Test Result
Model load time 10.8 s
Single request, 256 max tokens — throughput 19.9 tok/s
Prompt processing (1,302 tokens) 1,568 tok/s
Generation, 512 max tokens (110 generated) 20.3 tok/s

llama.cpp delivers roughly 20 tok/s on single-stream decode — better than the current vLLM AWQ number and attributable to llama.cpp's cuBLAS-based kernel path working well with the Q4_K_M quantization. llama.cpp does not support concurrent batching of multiple simultaneous requests the way vLLM does, so the 20 tok/s is a ceiling per session, not aggregate capacity. For single-user interactive workflows, llama.cpp at 20 tok/s is a comfortable reading speed for English output.

The 1,568 tok/s prompt processing speed is high. This measures how fast the model can ingest the prompt (prefill phase). Fast prefill matters when you are running a model on long system prompts or document context. At 1,568 tok/s, a 4,000-token document context is processed in under 3 seconds before generation begins.

For the economics and comparison against cloud alternatives, see T01 (tok/s per euro) and T02 (cost per million tokens on-prem vs cloud).

What the two engines are for

Situation Right choice
Multiple users hitting an API endpoint simultaneously vLLM (continuous batching scales)
Single interactive user, latency-sensitive llama.cpp (lower TTFT, comparable decode at 20 tok/s)
Long document processing, batch job vLLM (better GPU utilization via batching)
Simple local scripting or dev testing llama.cpp (10s load time vs 95s, simpler setup)

Stress Test — Sustained Load

Three 60-second stress tests were run to verify thermal and electrical stability under maximum load: GPU-only burn, CPU-only burn, and combined GPU+CPU burn.

GPU burn (FP16 matrix multiply at 100%, 60 s)

GPU Sustained TFLOPS Peak temp Peak power
GPU 0 165.8 67°C 482 W
GPU 1 153.2 64°C 450 W
GPU 2 166.4 72°C 501 W
GPU 3 166.2 62°C 481 W
Total 651.6 ~1,914 W combined

Zero computation errors across all four GPUs. Temps rose from ~28°C baseline to stable plateau by second 40. GPU 2 ran warmest at 72°C — it is the hottest card in the chassis airflow path at that position. The thermal throttle threshold on RTX 4090 is 83°C; the highest measured temp (72°C) leaves an 11°C margin. The system was not throttling at any point.

Power caps were set at different values across the four cards (480 W, 450 W, 500 W, 480 W). This is a minor inconsistency that should be normalized: nvidia-smi -pl 480 -i 0,1,2,3 (or whichever limit is appropriate) to set consistent caps before production use.

Combined GPU + CPU burn (all 4 GPUs + 64 CPU threads simultaneously, 60 s)

GPU Sustained TFLOPS Peak temp Peak power
GPU 0 164.9 69°C 480 W
GPU 1 152.5 67°C 450 W
GPU 2 165.2 73°C 519 W
GPU 3 165.1 66°C 480 W
Total 647.7 ~1,929 W combined

Adding 64 CPU threads at 100% load dropped aggregate GPU TFLOPS by only 0.6%. The EPYC 7542 draws roughly 200 W TDP at full load; the combined system was running at approximately 2.1–2.2 kW total. All temperatures stayed within margin: GPU 2 peaked at 73°C under combined load, still 10°C below throttle threshold.

The load average reached 55+ during the CPU burn phase (expected for 64 threads). System was fully stable throughout; no thermal events, no compute errors, no kernel panics.

This confirms the system is suitable for 24/7 sustained inference workloads where the CPU is also busy (preprocessing, tokenization, serving infrastructure overhead).

What Worked Well

The EPYC platform. PCIe lane budget is genuinely solved. All four RTX 4090 cards run at full Gen4 x16 under load (confirmed in the 04d PCIe status check). No bifurcation, no degraded slots. Some AMD Ryzen builds with four GPUs run two or more slots at x8; on EPYC Rome this is not the case.

RAM. 512 GB is comfortable. During llama.cpp model loading, the 38 GB GGUF file is memory-mapped; having 512 GB available means you can run multiple processes alongside the LLM server without competing for memory.

NVMe sequential speed. 4,589 MB/s sequential read keeps load times short. On a model iteration workflow — loading, testing, switching models — this adds up over a day.

Thermals. Worst-case sustained temperature of 73°C (GPU 2, combined burn) with 10°C of headroom. In a typical inference workload, GPUs will not run at 100% utilization continuously — decode is VRAM-bandwidth-bound, not compute-bound — so real operating temperatures will be lower than the stress test peak.

llama.cpp 1,568 tok/s prompt eval. This number surprised us on the positive side. The cuBLAS prefill path on four 4090s is fast. Long-context applications benefit from this.

What Surprised Us

The vLLM AWQ kernel miss. The benchmark ran with the awq kernel instead of awq_marlin. The commissioning suite triggered this automatically based on the model config at the time. With awq_marlin, single-request throughput is expected to jump from 8 tok/s to 16–24 tok/s. This is a software configuration fix, not a hardware limitation — verify your vLLM quantization method string when setting up production serving.

D2H bandwidth. Device-to-host at 1.4 GB/s caught us off guard during initial analysis. It is a CUDA non-pinned memory behavior, not a PCIe fault. For standard serving stacks (vLLM, llama.cpp) it does not matter. For custom inference code that moves tensors to CPU for post-processing, use pinned memory allocations.

GPU 3 PCIe AER corrected errors. During the vLLM benchmark, GPU 3 (bus c1:00.0) logged corrected PCIe errors (RxErr + BadTLP). These were hardware-auto-corrected and did not affect computation. Likely cause: PCIe riser seating or link renegotiation at Gen4 speed. The recommendation is to monitor under sustained production load; if error counts increase, reseat the GPU 3 riser cable. Zero errors during stress tests themselves.

Brief NIC link flap. The network interface briefly went down and came back during GPU benchmark load (13:38 UTC). Likely a power transient from simultaneous GPU power-up. In production with nvidia-persistenced running (which keeps GPU contexts initialized), power transients at load-start are smaller. Enable nvidia-persistenced as a systemd service before production.

What We Would Do Differently for v2

Enable awq_marlin from the start. Verify the vLLM quantization kernel path during commissioning, not after. Add a kernel identity check to the commissioning script.

Normalize GPU power limits before benchmarking. The four cards shipped with different configured limits (480 W, 450 W, 500 W, 480 W). Setting a consistent limit (nvidia-smi -pl) before the first benchmark run gives cleaner and more comparable numbers, and avoids the inconsistent power draw during combined burn.

Add a monitoring stack at delivery. Prometheus with the DCGM exporter, plus Grafana, takes a few hours to set up and makes GPU temp, VRAM utilization, and PCIe error rates visible in real time. Ship this as part of the standard commissioning rather than leaving it as a post-delivery task. See L05 for the stack setup guide.

Pin nvidia-persistenced into the systemd unit file during OS setup. It is a one-liner but gets missed consistently. Without it, the first GPU load after a quiet period takes a few extra seconds and produces the power transient that caused the NIC flap.

LVM expansion. The OS disk (512 GB SATA) has only 100 GB allocated to the LVM partition. The remaining 374 GB is unallocated. There is no reason to leave it: lvextend and resize2fs take 30 seconds and give you that space back for OS overhead, logs, and Docker layers.

Consider a second NVMe for model storage. The single 2 TB NVMe currently holds all models and will fill as the model library grows. A second 4 TB NVMe in RAID 0 or simply as a separate /mnt/nvme2 mount would add flexibility and keep sequential read performance high across a larger total library.

Comparison: Alternatives to This Build

The customer's workload could have been approached with different hardware. Here is an honest comparison:

Configuration VRAM total GPU↔GPU interconnect Notes
4× RTX 4090 (this build) 96 GB PCIe P2P, 19–22 GB/s Good for 70B class. No NVLink = PCIe P2P penalty on TP.
4× RTX Pro 6000 Blackwell 384 GB PCIe P2P (no NVLink on Pro 6000) Same PCIe topology, 4× the VRAM — overkill for a single 70B, right for multi-model or 200B+
8× L40 384 GB PCIe P2P Datacenter-class ECC, same no-NVLink topology, higher cost
8× RTX 4090 192 GB PCIe P2P Double the throughput capacity; 8-GPU chassis uses AMD EPYC Genoa/Turin (per K-AI lineup)

The 4× RTX 4090 at 96 GB is the minimum viable configuration for running Llama 3.3 70B AWQ INT4 under vLLM with meaningful batch throughput. It is not the ceiling; an 8-GPU build adds capacity proportionally. For a research customer who wants a single dedicated workstation — not a serving cluster — 4× is often the right place to start.

Neither the 4× nor the 8× configuration has NVLink between GPUs, which means tensor-parallel inference operates over PCIe. The practical consequence is on TTFT latency for single requests, not on aggregate batched throughput. For a team-size workload (tens of requests per hour, not thousands), this is not a limiting factor. For sub-100 ms TTFT requirements, see N03 and the reasoning for why the lineup targets PCIe-connected multi-GPU rather than NVSwitch fabric systems.

Total Power and Electrical Planning

Under sustained GPU load, total system power was approximately 1,900–2,200 W measured at the GPU rails (1,914 W GPU-only burn; ~2,100 W estimated combined with CPU, drives, board). Account for PSU efficiency losses (assume 90%) and plan for a 16 A/230 V circuit minimum, with 20 A preferred for headroom.

The dual-PSU layout splits this across two outlets. Both outlets need to be on circuits that can independently handle the load: if PSU A feeds two GPUs at 500 W each plus 200 W of board/CPU, that is 1,200 W on its circuit. Size accordingly.

Room cooling budget: treat this system as 2.5 kW sustained (conservative, includes efficiency losses and a margin). For a server room or rack enclosure with multiple systems, the aggregate number compounds quickly.

See P01 for single-phase vs three-phase considerations and P04 for breaker sizing.

Pricing Band

A build at this specification — EPYC 7542 platform, 512 GB ECC RAM, 4× RTX 4090, 2 TB NVMe, rack chassis with dual PSU — falls in the €18,000–€24,000 ex VAT range at current component prices, depending on chassis selection, RAM sourcing timing, and GPU availability. Lead time is 10–28 days depending on component availability, confirmed at order.

This range does not include installation, rack PDU, networking (10 GbE switch, cabling), or any software licensing. The system ships with Ubuntu 24.04 LTS pre-installed and the Python venv pre-configured; bring your own model weights.

What This Build Is Right For

  • Research and small-team LLM inference. Running a 70B model for a team of 5–15 users. The 179 tok/s aggregate under vLLM handles concurrent sessions comfortably.
  • Model evaluation and iteration. Fast NVMe load times mean swapping between candidate models is quick. The EPYC platform's PCIe lane budget means all four GPUs are always at full bandwidth.
  • Data-sovereign deployments. All inference is local. No tokens leave the building. This is the primary non-economic reason to run on-prem for research contexts.
  • Budget-conscious 70B entry point. The RTX 4090 is the highest-VRAM consumer GPU available. At 24 GB per card and four cards, 96 GB total gets you to 70B class without stepping to professional GPU pricing.

What This Build Is Not Right For

  • Sub-100 ms single-request TTFT. The PCIe P2P tensor-parallel topology puts a floor on TTFT for large models. If you need fast interactive latency on 70B+ models, this architecture is the wrong choice. You want NVLink-connected GPUs, which means a different class of hardware entirely (see N03).
  • Running multiple large models simultaneously. At 96 GB total VRAM with gpu_memory_utilization=0.80, you have roughly 77 GB usable. A second 70B INT4 model would not fit. If you need to host two models simultaneously, upgrade to a platform with more VRAM per GPU or more GPUs.
  • Large-scale production serving. For hundreds of concurrent users or SLA-backed uptime, this architecture (no NIC add-ons on a 4-GPU chassis, single-node, consumer GPUs with ECC off) is not the right foundation. The 8-GPU K-AI server with L40 or RTX Pro 6000 Blackwell, proper monitoring, and dual NICs is.
  • Training large models. At 96 GB total VRAM, you can fine-tune (LoRA, QLoRA) on 70B models, but you cannot full-parameter-train them. For training, VRAM budget is more constrained than for inference. If training is part of the workload plan, factor this in.

Lessons Learned & What to Do Next

The four actionable items before putting this system into production:

  1. Verify the vLLM kernel path. Run vllm serve --help and confirm awq_marlin is selected for AWQ models on Ada Lovelace GPUs. Expected result: single-request decode jumps from 8 tok/s to 16–24 tok/s.
  2. Normalize power limits. Run nvidia-smi -pl 480 -i 0,1,2,3 (adjust to your chosen cap) and confirm all four cards report the same limit before any production benchmarking or workload run.
  3. Enable nvidia-persistenced as a systemd service. Prevents the power transient on first load that caused the NIC flap. One-liner, do it at OS setup time.
  4. Deploy the monitoring stack. Prometheus + DCGM exporter + Grafana. GPU temp, VRAM utilization, PCIe error counters, queue depth. Without it, the first sign of trouble will be a user complaint rather than an alert. See L05.

Related reading: N03 (NVLink vs PCIe P2P — when the gap matters), W02 (PCIe lane topology on EPYC), W04 (PSU sizing and dual-PSU vs N+1), W05 (thermal design for rack-mount GPU systems), T01 (tok/s per euro comparison), T02 (on-prem vs cloud cost per million tokens).


Späť na blog