Kentino s.r.o.
K-AI 192 RomeDual 4090 5288TOPS — 8× RTX 4090 — Dual EPYC Milan
K-AI 192 RomeDual 4090 5288TOPS — 8× RTX 4090 — Dual EPYC Milan
Kan beschikbaarheid voor afhalen niet laden
K-AI 192 RomeDual 4090 5288TOPS
192 GB VRAM 8-GPU Inference Server
8x RTX 4090 | Dual EPYC Milan | 5 288 TOPS INT8
Flagship 8x gaming-GPU inference box. 192 GB pool at consumer-card economics on a dual-socket EPYC Milan platform.
A 7U 8-GPU chassis built around dual EPYC 7643 Milan CPUs (96C/192T total), ASRock Rack ROME2D32GM-NL dual-SP3 motherboard, 512 GB DDR4 ECC, 2 TB NVMe boot, and a 5x 1200 W server PSU set. Eight GeForce RTX 4090 connect via active PCIe Gen4 retimer risers at full x16. The cheapest path to 192 GB frontier MoE inference on Kentino hardware.
Hardware
| Component | Detail |
|---|---|
| GPUs | 8x NVIDIA GeForce RTX 4090 24 GB GDDR6X (Ada Lovelace, 450 W, PCIe 4.0 x16) |
| VRAM pool | 192 GB total across 8 cards (no NVLink on consumer RTX 4090) |
| CPU | 2x AMD EPYC 7643 Milan (48C/96T each — 96C/192T total, 225 W each, 2x 128 PCIe 4.0 lanes) |
| Motherboard | ASRock Rack ROME2D32GM-NL (dual SP3, PCIe 4.0, 32x DDR4 ECC DIMM slots) |
| System RAM | 512 GB DDR4-2666 ECC RDIMM (8x 64 GB — 4 per socket for 8-channel balance) |
| Boot / storage | 2 TB NVMe M.2 (PCIe 4.0 x4) |
| Power supply | 5x 1200 W server PSU set (HP-compatible, hot-swap) + full 12VHPWR adapter set |
| Chassis | 7U 8-GPU chassis (up to 10 PCIe cards including risers) |
| Risers | 8x active PCIe Gen4 x16 retimer risers (required over cable length) |
| Cooling | 2x Arctic Freezer 4U-M SP3 tower coolers + rack-mount front-to-back airflow (industrial fans) |
| Network | Onboard dual 10 GbE (Intel X550) |
Power envelope
- GPU draw: 8 x 450 W = 3 600 W
- CPU draw: 2 x 225 W = 450 W
- System total at full load: ~4 200 W
- PSU total: 6 000 W all-active (5x 1200 W) — 30.0 % headroom
Lane topology
ROME2D32GM-NL exposes 2x 128 PCIe Gen4 lanes — one 128-lane pool per EPYC socket — direct to GPU slots. Active Gen4 retimer risers for signal integrity. No PCIe switch. No NVLink. Measured 19-22 GB/s inter-GPU peer-to-peer on 4-GPU bench.
What you can run
With 192 GB across 8 cards, this server handles 200B+ frontier MoE at Q4, 8-way tensor-parallel inference, tenant-isolated multi-model serving, and high-batch throughput at consumer-card economics.
LLMs — text / reasoning / coding
Chinese frontier
- Qwen3 / Qwen3.5 (Alibaba): Qwen3-235B-A22B Q4 (~132 GB) with long ctx — the hero config (~15-25 tok/s single-stream on 8x RTX 4090); Qwen3-Coder-480B-A35B Q2 (~160 GB); Qwen3.5-122B-A10B fp8 (~75 GB) multi-stream; Qwen3-32B dense bf16 x multiple concurrent
- DeepSeek: DeepSeek-V3/R1 Q2 (~215 GB with 512 GB host spill); DeepSeek-R2 32B bf16 — up to 8 concurrent streams one per card (~30-40 tok/s per stream)
- GLM / Z.ai: GLM-4.5 / 4.6 / 4.7 Q4 (~177 GB); GLM-4.5-Air fp8 or bf16; GLM-4.6V 106B
- Tencent Hunyuan: Hunyuan-Large Q3 (~160 GB); Hunyuan-A13B Q4/Q6 (RTX 4090 is Ada — fp8 upcasts to bf16, use GGUF quants)
- Others: Baidu ERNIE-4.5-424B Q3 (~180 GB); InternVL3.5-241B-A28B Q4 (~135 GB); Qwen3.5-397B Q3 (~170 GB); MiniMax-M1 Q3 (~180 GB)
Western frontier
- Meta Llama: Llama 3.3 70B bf16 with massive KV (~20 tok/s single-stream Q4, ~179 tok/s batch-32 vLLM — Kentino measured on 4-GPU bench); Llama 4 Scout bf16 (~218 GB tight); Llama 4 Maverick Q3 (~188 GB)
- Mistral: Mistral Large 2 / Pixtral Large 123B Q6 comfortable or bf16 (~248 GB spill); Mistral Small 3 multi-stream
- OpenAI (open weights): gpt-oss-120b MXFP4 native (80 GB) with huge KV
- NVIDIA Nemotron: Llama-3.1-Nemotron Ultra 253B Q4 (~147 GB); Super 49B bf16
- Others: Cohere Command R+ 104B Q6 (~85 GB); Google Gemma 3 27B bf16 x multiple streams
Vision-Language Models
InternVL3.5-241B-A28B Q4 (~135 GB); Qwen3-VL-235B-A22B Q4; Qwen3-VL-32B bf16 multi-stream; Llama 3.2 90B Vision bf16 (~180 GB); Pixtral Large 124B Q6; Molmo 72B bf16; GLM-4.6V 106B fp8/Q6; Gemma 3 27B multimodal x multiple streams.
Image generation
FLUX.1 [dev] bf16 — up to 8 concurrent generation streams (one per card, ~15-25 s/image at fp8); FLUX.1 Kontext [dev]; FLUX Tools; SD 3.5 Large bf16 x 8; HunyuanImage-2.1 bf16 (~34 GB) x 2-4 concurrent; HunyuanImage-3.0 base (80B MoE, 13B active) bf16; HunyuanDiT; Kolors / Kolors 2.0; AuraFlow; OmniGen v1; PixArt-Sigma.
Video generation
Wan 2.2 MoE dual-expert bf16 with full ctx — multiple concurrent streams; Wan 2.2 TI2V-5B x 8 concurrent; HunyuanVideo 13B bf16 both experts; HunyuanVideo 1.5; CogVideoX-5B bf16; Open-Sora 2.0 11B bf16; Genmo Mochi-1 bf16; LTX-Video x 8 concurrent; Pyramid Flow; SVD / SV3D / SV4D; NVIDIA Cosmos.
Audio / Speech / TTS
- ASR: Whisper v3 large / turbo x 8 concurrent (~50x realtime per stream); Parakeet-TDT; Canary 1B; Qwen3-ASR; SenseVoice
- TTS: CosyVoice 2/3; Kokoro 82M; XTTS v2; Stable Audio Open
- Realtime / S2S: Kyutai Moshi 7B x 8 concurrent voice streams; Step-Audio 2 mini/R1; Qwen2.5-Omni-7B
- Music / SFX: MusicGen / AudioGen / Bark; SeamlessM4T v2
Multi-model / multi-tenant serving
- 8-way tensor-parallel inference of 200-250B MoE at Q4 (Qwen3-235B, GLM-4.5/4.6/4.7)
- Tenant-isolated 8-stream serving — one 24 GB Q4 model per card (e.g. 8x Qwen3-14B agents)
- Large-batch 70B — tensor-parallel vLLM / SGLang batch-64 aggregate
- Mixed fleet: 235B MoE on 4 cards (TP4) + FLUX + video + realtime voice on remaining 4
- Fine-tuning lab — 7-34B LoRA / QLoRA with large batch
Target workloads
- 8-GPU tensor-parallel inference at the 192 GB pool — Qwen3-235B Q4, GLM-4.5/4.6/4.7 Q4, Llama 4 Scout bf16
- Dense 70B bf16 (Llama 3.3 70B) with massive KV headroom for long ctx and high batch
- High-throughput batch inference gateway — vLLM / SGLang tensor-parallel at large batch
- Fine-tuning of 7-34B class models with high-batch LoRA / QLoRA
- Wan 2.2 dual-expert / HunyuanImage-3.0 / FLUX.1 full workflow video-image studio
Measured performance
Kentino bench (4-GPU reference) | 2026-04-10 | 4x RTX 4090 + EPYC 7542 + 512 GB DDR4 + ROMED8-2T
| Benchmark | Result |
|---|---|
| Sustained compute (fp16, 4-card ref) | 647 TFLOPS |
| vLLM — Llama 3.3 70B AWQ INT4 (single) | 8.0 tok/s |
| vLLM — Llama 3.3 70B AWQ INT4 (batch-32) | 179 tok/s aggregate |
| llama.cpp — Llama 3.3 70B Q4_K_M (single) | 20.3 tok/s decode |
| 8-GPU aggregate compute (extrapolation) | ~1 294 TFLOPS fp16 expected (near-linear) |
| 235B Q4 tensor-parallel 8-way (community) | 15-25 tok/s single-stream on 8x RTX 4090 |
4-card data measured on Kentino hardware. 8-GPU extrapolation is published external reference. Kentino will publish first-party 8-GPU numbers after the first customer build.
Not ideal for
- 5090-generation workloads (Blackwell fp8 native + higher TOPS) — see K-AI 256 TurinDual 5090
- Training from scratch (no NVLink on consumer RTX 4090)
- ECC-sensitive 24/7 production — consumer RTX 4090 has no ECC; prefer 4x L40 or 2x RTX Pro 6000 Server Edition
- Hunyuan / DeepSeek fp8 native — RTX 4090 is Ada, fp8 checkpoints upcast to bf16
Warranty and lead time
Build includes assembly, BIOS config with dual-socket NUMA tuning, driver install, burn-in, memtest, full 8-GPU stress test, and LLM environment setup. Lead time depends on component availability, confirmed at order.
Recommended add-ons
- 4 TB additional NVMe for weight staging and MoE offload workloads
- NVIDIA ConnectX-5 100 GbE for multi-node serving
- RAM upgrade to 1 TB (16x 64 GB) or 2 TB (32x 64 GB) — board supports 32 DIMM slots
- Full 24U rack cabinet + online UPS 5 kVA
Share
