RAM and VRAM: How They Relate in an AI Server
The first question buyers ask about an AI server is "how many GPUs". The second is "what CPU". The question that actually decides whether the box works well — and the one most spec sheets bury — is how the two memory systems are sized relative to each other. A 4-GPU machine with 192 GB of VRAM and 32 GB of system RAM is broken. The same machine with 1 TB of system RAM is, for most workloads, money set on fire. The right answer sits in between and depends on what you actually run.
This article walks through what VRAM and system RAM each do, how they relate, where the bandwidth bottlenecks sit, and what ratios hold up in practice. The audience is buyers and integrators sizing a build, not engineers writing CUDA kernels.
What VRAM actually holds
When a model is "loaded on the GPU", three things live in VRAM:
- Model weights. A 70B model at FP16 is 140 GB; at INT8, 70 GB; at INT4 (the common self-host quantization), 35–40 GB depending on quant scheme.
- KV cache. Per-request memory cost of attention. A 70B model at 8K context single-stream is 1–2 GB. At 32K it is 4–8 GB. Under batched serving (10–20 concurrent), this fills your remaining VRAM, not the weights.
- Activations and workspace. Forward-pass intermediates, attention scratch, kernel workspace. A few GB for inference; substantially more during training because activations are stored for the backward pass.
For training, add optimizer state (Adam keeps two FP32 values per weight — roughly 8× the FP16 weight size) and gradients (1× weight size). This is why training a 70B model from scratch needs 8× H100 or A100 80 GB nodes and is not something a Kentino box does. Fine-tuning with LoRA or QLoRA is a different story and lives comfortably on a 4-GPU 5090 or Pro 6000 Blackwell build.
Practical implication: model size in VRAM is not "parameters × bytes per parameter". For an 8K-context deployment of a 70B model at INT4, plan 40 GB weights + 20–40 GB KV cache at realistic batch + 4 GB overhead = ~70 GB. That fits on a single RTX Pro 6000 Blackwell Server Edition (96 GB) or needs 3–4 RTX 5090 for any reasonable batch. "VRAM total" matters less than "VRAM per card and how they connect".
VRAM bandwidth: the number that decides token-gen speed
Token generation on a transformer LLM is bandwidth-bound, not compute-bound. Each generated token reads the entire model from VRAM through the memory bus. The spec-sheet TFLOPS number is largely irrelevant for inference; what matters is GB/s of memory bandwidth.
| GPU | VRAM | Memory type | Bandwidth | Source |
|---|---|---|---|---|
| RTX 4090 | 24 GB | GDDR6X | 1.01 TB/s | NVIDIA spec |
| RTX 5090 | 32 GB | GDDR7 | 1.79 TB/s | NVIDIA spec |
| RTX Pro 6000 Blackwell (workstation) | 96 GB | GDDR7 ECC | 1.79 TB/s | NVIDIA spec |
| RTX Pro 6000 Blackwell Server Ed. | 96 GB | GDDR7 ECC | 1.79 TB/s | NVIDIA spec |
| L40 | 48 GB | GDDR6 ECC | 0.86 TB/s | NVIDIA spec |
| L4 | 24 GB | GDDR6 | 0.30 TB/s | NVIDIA spec |
| H100 SXM (reference, not sold) | 80 GB | HBM3 | 3.35 TB/s | NVIDIA spec |
| H200 SXM (reference, not sold) | 141 GB | HBM3e | 4.8 TB/s | NVIDIA spec |
Kentino does not sell H100 or H200; they are listed for honest comparison. They remain the bandwidth kings and the reason hyperscalers buy them. Price gap is 6–10×, bandwidth gap on single-stream inference is 2×. For non-hyperscale workloads, that math does not favour HBM.
A rough rule for INT4 single-stream token generation: tok/s ≈ bandwidth (GB/s) / model size (GB), times a stack efficiency factor of 0.6–0.8. A 70B model at INT4 (~40 GB) on a single 5090:
1790 GB/s × 0.7 / 40 GB ≈ 31 tok/s (single stream, no batching)
This matches what we measure on the bench. Batching pushes aggregate throughput to 50–100 tok/s, but per-stream speed stays close to the bandwidth ceiling. No amount of system RAM changes this number.
ECC VRAM: real for training, less critical for inference
The RTX Pro 6000 Blackwell line carries ECC (error-correcting) VRAM. Consumer cards (5090, 4090) do not. Marketing makes this sound critical; the reality is more nuanced.
ECC VRAM detects and corrects single-bit memory errors in flight. Without it, a flip propagates — usually invisibly during inference (one token slightly different than it would have been), occasionally catastrophically during training (NaN propagation, divergence, dead run).
When ECC matters:
- Long-running training. Memory traffic over multi-day jobs makes a silent bit flip a real probability. Losing a 48-hour run to an undetected error is much worse than to a corrected one.
- Numerical workloads with no human in the loop. Simulation, modelling, anything consumed downstream without sanity-checking.
- Regulated workloads. If your compliance regime requires bit-exact reproducibility, ECC is mandatory.
When ECC is largely cosmetic:
- LLM inference serving. Bit-flip rate on modern GDDR7 is low enough that output-quality impact sits below noise. We have run consumer 5090 cards under heavy inference for months without seeing anomalies traceable to VRAM errors.
- Image and video generation. Perceptual noise floor swallows a single-bit error.
- Development and experimentation. Restart and rerun is cheap.
Honest version: if the workload is primarily inference, the Pro 6000 premium pays for 96 GB of VRAM and validated drivers, not the ECC. If the workload is training, ECC earns its keep. We sell both and will tell you the same thing on the phone.
System RAM: how much, and the truth about CPU offload
System RAM does four things in an AI server:
- Stages model load from disk to VRAM. A 70B model file moves NVMe → page cache → system RAM → VRAM. If system RAM is smaller than the file, loading either fails or thrashes.
- Backs the OS, the inference server (vLLM, llama.cpp, Triton), and auxiliary services (vector DB, monitoring, request queue).
- Holds tokenizer state, request queues, and pre/post-processing buffers.
- Optionally hosts CPU-offloaded layers. This is the one people overestimate.
CPU offload, in llama.cpp and similar runtimes, lets you run a model larger than VRAM by keeping some layers on CPU and streaming them through the GPU per token. It works. In nearly every realistic case, it is also an exercise in misery.
Numbers: a 5090 has 1.79 TB/s of VRAM bandwidth. A 12-channel EPYC Genoa platform with DDR5-4800 delivers ~460 GB/s aggregate. CPU offload is 4–6× slower per token than full VRAM residency, optimistically — that assumes perfect NUMA locality and a CPU that is not also busy with serving overhead.
Benchmarks from a 4×5090 box with --n-gpu-layers tuned:
- Fully on GPU (70B INT4 across 4×24 GB): 28–32 tok/s single-stream.
- 80% on GPU, 20% on CPU: 6–9 tok/s.
- 50/50: 2–4 tok/s.
This is not a Kentino opinion. It is how DDR5 bandwidth relates to GDDR7 bandwidth. The fix for "model does not fit in VRAM" is more or better GPUs, not system RAM with offload. The exception is the AMD Ryzen AI Max 300 unified-memory platform, a different beast and out of scope here.
Buy enough system RAM to load and serve, not to compute.
How much system RAM, concretely
A working rule for K-AI builds:
System RAM ≈ 1.5 × total VRAM, rounded to the next standard config.
For a 4-GPU build:
| Build | Total VRAM | Recommended system RAM |
|---|---|---|
| 4× RTX 4090 (96 GB total) | 96 GB | 128 GB |
| 4× RTX 5090 (128 GB total) | 128 GB | 192 GB |
| 4× RTX Pro 6000 BW (384 GB total) | 384 GB | 512 GB |
| 4× L40 (192 GB total) | 192 GB | 256 GB |
For 8-GPU builds, RAM scaling is not strictly linear — stay within one socket's channels where you can. We default to 256 GB on 8× 5090 and 512 GB on 8× Pro 6000 Blackwell.
The rule has two failure modes at the edges:
- Under-spec'd: 64 GB on an 8-GPU box. Model loads slowly, the page cache cannot hold weights for fast reload, and concurrent serving plus auxiliary services (pgvector, monitoring) starts to swap.
- Over-spec'd: 2 TB on a 4-GPU inference box. It works fine, but you have spent €4,000–€8,000 on RAM that pages air. The exception is hosting many models and rotating them VRAM↔RAM — then large system RAM acts as a hot cache. Rare outside research labs.
The "64 GB is enough" case exists: a 2-GPU machine, one model at a time, no concurrency, no auxiliary services. Not a serious server but a serious developer workstation.
EPYC channels: where bandwidth actually comes from
System RAM bandwidth on AMD EPYC (the basis for nearly all our 8-GPU servers) scales with the number of populated memory channels, not the headline DIMM speed. Channels are per socket, populated one DIMM per channel.
| Platform | Channels per socket | DIMM speed (typical) | Per-socket bandwidth |
|---|---|---|---|
| EPYC 9004 (Genoa) | 12 | DDR5-4800 | ~460 GB/s |
| EPYC 9005 (Turin) | 12 | DDR5-6000 | ~576 GB/s |
| EPYC 9005 Turin Dense | 12 | DDR5-6400 | ~614 GB/s |
| Xeon SP 5th gen | 8 | DDR5-5600 | ~358 GB/s |
Two things from this table:
- Populate all twelve channels on an EPYC Genoa/Turin platform to get the advertised bandwidth. Eight DIMMs in a twelve-channel system gives eight channels of bandwidth, not twelve. We see this misconfigured constantly in the field.
- DIMM count drives the minimum sensible RAM size. 12 × 16 GB = 192 GB. 12 × 32 GB = 384 GB. The "save money" configurations that under-populate (six 32 GB DIMMs for 192 GB) leave half the bandwidth on the table. Do not do this.
Dual-socket gives 24 channels total; bandwidth doubles in aggregate, but only if the workload respects NUMA.
NUMA: the cost of crossing the line
A dual-socket EPYC server has two CPU dies, each with its own memory controllers, DIMM slots, and PCIe root complex. Crossing from one socket's memory to the other socket's GPU traverses Infinity Fabric — fast, but not as fast as staying local.
Rough but useful numbers:
| Path | Bandwidth | Latency penalty vs local |
|---|---|---|
| CPU socket 0 → local DIMM | ~576 GB/s | 1× (baseline) |
| CPU socket 0 → remote DIMM (via fabric) | ~256–320 GB/s | 1.6–2× latency |
| GPU on socket 0 → local DIMM (via PCIe + DMA) | ~28 GB/s (PCIe 5.0 x16) | 1× |
| GPU on socket 0 → DIMM on socket 1 | ~14–20 GB/s | 1.5–2× latency |
For inference, NUMA penalty is usually invisible — once the model is in VRAM, system RAM traffic is a trickle. NUMA matters when:
-
Loading a model. A 100 GB load from the wrong node takes noticeably longer. Bind with
numactlor set affinity in your container runtime. - CPU-side preprocessing (tokenization at scale, image decode, audio resampling). A busy tokenizer on socket 0 with GPUs hanging off socket 1 loses 20–40% throughput.
- Training with CPU-offloaded optimizer state (DeepSpeed Zero-Offload). NUMA-foreign state doubles step time. Pin everything.
Practical answer: default to single-socket for inference servers unless you have a specific reason to go dual. Dual-socket exists in our lineup (K-AI 256 Turin Dual) because some workloads — concurrent training plus inference, large in-memory vector stores, eight GPUs needing two root complexes — actually need it. Most do not. Single-socket Turin with 12 channels and 384–512 GB handles most inference use cases.
DDR5 RDIMM vs LRDIMM, and ECC
Server RAM in 2026 is uniformly DDR5 ECC. The choice is RDIMM vs LRDIMM:
- RDIMM (Registered): standard server memory, buffered command path, ECC included. Clean up to 64 GB modules, 128 GB on some platforms.
- LRDIMM (Load-Reduced): adds a memory buffer that reduces bus load, enabling higher per-channel capacity. Required for 128 GB+ modules. Slightly higher latency, marginal in real workloads.
Kentino default: 32 GB or 64 GB RDIMMs at DDR5-4800 (Genoa) or DDR5-6000 (Turin). LRDIMM only when the build requires 1 TB+, rare outside training or multi-model hosting. ECC is non-negotiable — non-ECC server DIMMs do not exist in the platforms we ship.
What breaks when memory is wrong
Predictable failure modes, in rough order of frequency:
- Slow model load on under-spec'd RAM. A 70B model is ~40 GB on disk. With 32 GB of system RAM, loading thrashes the page cache and a 40-second cold-start becomes 4 minutes. Fix: 1.5× total VRAM minimum.
- Half-bandwidth penalty from under-populated DIMM channels. Six DIMMs in a twelve-channel EPYC. CPU-bound preprocessing halves silently. Fix: populate all channels.
-
NUMA-foreign access on dual-socket with mismatched affinity. Fix:
numactl --cpunodebind=0 --membind=0, or the framework's NUMA-aware mode. -
OOM at high batch on under-estimated KV cache. vLLM's
--gpu-memory-utilization 0.9leaves 10% headroom, but 64-concurrent at 32K context still overflows a 24 GB card. Fix: shorter context, smaller batch, or more VRAM. - CPU offload "saves" the build and destroys throughput. "The server is slow" — turns out 30% of layers sit on CPU because VRAM was tight. Sizing error, not tuning. Buy the right GPU count up front.
None of these are exotic. All show up in the first month of a new install.
When to pay attention
For inference-only deployments:
- Which models do you need to host concurrently? Sum their INT4 footprints. Add 40–60% for KV cache at target batch and context. That is your minimum VRAM.
- What is your latency target per token? Largest model footprint divided by per-card bandwidth tells you whether you need one fast card, four medium cards, or eight smaller cards.
- System RAM minimum: 1.5× total VRAM, populated across all memory channels. Round up to the next standard config.
- Single or dual socket? Default single. Go dual only when you need eight GPUs on two PCIe root complexes, or you mix large training with inference.
- ECC? Yes if training is a real part of the workload or compliance demands it. Skip on pure inference if the budget is tight.
For training-capable builds, the RAM rule shifts to 2–3× total VRAM — DeepSpeed, Megatron, and similar frameworks lean on system RAM during step execution. NUMA discipline becomes non-optional.
Follow-up articles cover the rest of the stack: PCIe topology and lane assignment (W02), GPU risers and their failure modes (W03), PSU sizing and dual-PSU reality (W04), and thermal envelope design (W05). Memory is the first lever to get right because it sits between every other component — wrong memory makes everything else look broken.
This is part of the Kentino Wiki, a reference series on AI compute, robotics, and the systems that connect them. Comments and corrections welcome at info@kentino.com.