Kentino s.r.o.

K-AI 192 RomeDual 4090 5288TOPS — 8× RTX 4090 — Dual EPYC Milan

Name: K-AI 192 RomeDual 4090 5288TOPS — 8× RTX 4090 — Dual EPYC Milan
Brand: Kentino s.r.o.
Price: 32280.00 EUR
Availability: InStock

€32.280,00 EUR

Aanbieding Uitverkocht

Belastingen inbegrepen. Verzendkosten worden berekend bij de checkout.

Aantal

K-AI 192 RomeDual 4090 5288TOPS

192 GB VRAM 8-GPU Inference Server
8x RTX 4090 | Dual EPYC Milan | 5 288 TOPS INT8

5 288

INT8 TOPS

192 GB

VRAM pool

8-GPU

tensor parallel

dual

CPU 96C/192T

Flagship 8x gaming-GPU inference box. 192 GB pool at consumer-card economics on a dual-socket EPYC Milan platform.

A 7U 8-GPU chassis built around dual EPYC 7643 Milan CPUs (96C/192T total), ASRock Rack ROME2D32GM-NL dual-SP3 motherboard, 512 GB DDR4 ECC, 2 TB NVMe boot, and a 5x 1200 W server PSU set. Eight GeForce RTX 4090 connect via active PCIe Gen4 retimer risers at full x16. The cheapest path to 192 GB frontier MoE inference on Kentino hardware.

Hardware

Component	Detail
GPUs	8x NVIDIA GeForce RTX 4090 24 GB GDDR6X (Ada Lovelace, 450 W, PCIe 4.0 x16)
VRAM pool	192 GB total across 8 cards (no NVLink on consumer RTX 4090)
CPU	2x AMD EPYC 7643 Milan (48C/96T each — 96C/192T total, 225 W each, 2x 128 PCIe 4.0 lanes)
Motherboard	ASRock Rack ROME2D32GM-NL (dual SP3, PCIe 4.0, 32x DDR4 ECC DIMM slots)
System RAM	512 GB DDR4-2666 ECC RDIMM (8x 64 GB — 4 per socket for 8-channel balance)
Boot / storage	2 TB NVMe M.2 (PCIe 4.0 x4)
Power supply	5x 1200 W server PSU set (HP-compatible, hot-swap) + full 12VHPWR adapter set
Chassis	7U 8-GPU chassis (up to 10 PCIe cards including risers)
Risers	8x active PCIe Gen4 x16 retimer risers (required over cable length)
Cooling	2x Arctic Freezer 4U-M SP3 tower coolers + rack-mount front-to-back airflow (industrial fans)
Network	Onboard dual 10 GbE (Intel X550)

Power envelope

GPU draw: 8 x 450 W = 3 600 W
CPU draw: 2 x 225 W = 450 W
System total at full load: ~4 200 W
PSU total: 6 000 W all-active (5x 1200 W) — 30.0 % headroom

Lane topology

ROME2D32GM-NL exposes 2x 128 PCIe Gen4 lanes — one 128-lane pool per EPYC socket — direct to GPU slots. Active Gen4 retimer risers for signal integrity. No PCIe switch. No NVLink. Measured 19-22 GB/s inter-GPU peer-to-peer on 4-GPU bench.

What you can run

With 192 GB across 8 cards, this server handles 200B+ frontier MoE at Q4, 8-way tensor-parallel inference, tenant-isolated multi-model serving, and high-batch throughput at consumer-card economics.

LLMs — text / reasoning / coding

Chinese frontier

Qwen3 / Qwen3.5 (Alibaba): Qwen3-235B-A22B Q4 (~132 GB) with long ctx — the hero config (~15-25 tok/s single-stream on 8x RTX 4090); Qwen3-Coder-480B-A35B Q2 (~160 GB); Qwen3.5-122B-A10B fp8 (~75 GB) multi-stream; Qwen3-32B dense bf16 x multiple concurrent
DeepSeek: DeepSeek-V3/R1 Q2 (~215 GB with 512 GB host spill); DeepSeek-R2 32B bf16 — up to 8 concurrent streams one per card (~30-40 tok/s per stream)
GLM / Z.ai: GLM-4.5 / 4.6 / 4.7 Q4 (~177 GB); GLM-4.5-Air fp8 or bf16; GLM-4.6V 106B
Tencent Hunyuan: Hunyuan-Large Q3 (~160 GB); Hunyuan-A13B Q4/Q6 (RTX 4090 is Ada — fp8 upcasts to bf16, use GGUF quants)
Others: Baidu ERNIE-4.5-424B Q3 (~180 GB); InternVL3.5-241B-A28B Q4 (~135 GB); Qwen3.5-397B Q3 (~170 GB); MiniMax-M1 Q3 (~180 GB)

Western frontier

Meta Llama: Llama 3.3 70B bf16 with massive KV (~20 tok/s single-stream Q4, ~179 tok/s batch-32 vLLM — Kentino measured on 4-GPU bench); Llama 4 Scout bf16 (~218 GB tight); Llama 4 Maverick Q3 (~188 GB)
Mistral: Mistral Large 2 / Pixtral Large 123B Q6 comfortable or bf16 (~248 GB spill); Mistral Small 3 multi-stream
OpenAI (open weights): gpt-oss-120b MXFP4 native (80 GB) with huge KV
NVIDIA Nemotron: Llama-3.1-Nemotron Ultra 253B Q4 (~147 GB); Super 49B bf16
Others: Cohere Command R+ 104B Q6 (~85 GB); Google Gemma 3 27B bf16 x multiple streams

Vision-Language Models

InternVL3.5-241B-A28B Q4 (~135 GB); Qwen3-VL-235B-A22B Q4; Qwen3-VL-32B bf16 multi-stream; Llama 3.2 90B Vision bf16 (~180 GB); Pixtral Large 124B Q6; Molmo 72B bf16; GLM-4.6V 106B fp8/Q6; Gemma 3 27B multimodal x multiple streams.

Image generation

FLUX.1 [dev] bf16 — up to 8 concurrent generation streams (one per card, ~15-25 s/image at fp8); FLUX.1 Kontext [dev]; FLUX Tools; SD 3.5 Large bf16 x 8; HunyuanImage-2.1 bf16 (~34 GB) x 2-4 concurrent; HunyuanImage-3.0 base (80B MoE, 13B active) bf16; HunyuanDiT; Kolors / Kolors 2.0; AuraFlow; OmniGen v1; PixArt-Sigma.

Video generation

Wan 2.2 MoE dual-expert bf16 with full ctx — multiple concurrent streams; Wan 2.2 TI2V-5B x 8 concurrent; HunyuanVideo 13B bf16 both experts; HunyuanVideo 1.5; CogVideoX-5B bf16; Open-Sora 2.0 11B bf16; Genmo Mochi-1 bf16; LTX-Video x 8 concurrent; Pyramid Flow; SVD / SV3D / SV4D; NVIDIA Cosmos.

Audio / Speech / TTS

ASR: Whisper v3 large / turbo x 8 concurrent (~50x realtime per stream); Parakeet-TDT; Canary 1B; Qwen3-ASR; SenseVoice
TTS: CosyVoice 2/3; Kokoro 82M; XTTS v2; Stable Audio Open
Realtime / S2S: Kyutai Moshi 7B x 8 concurrent voice streams; Step-Audio 2 mini/R1; Qwen2.5-Omni-7B
Music / SFX: MusicGen / AudioGen / Bark; SeamlessM4T v2

Multi-model / multi-tenant serving

8-way tensor-parallel inference of 200-250B MoE at Q4 (Qwen3-235B, GLM-4.5/4.6/4.7)
Tenant-isolated 8-stream serving — one 24 GB Q4 model per card (e.g. 8x Qwen3-14B agents)
Large-batch 70B — tensor-parallel vLLM / SGLang batch-64 aggregate
Mixed fleet: 235B MoE on 4 cards (TP4) + FLUX + video + realtime voice on remaining 4
Fine-tuning lab — 7-34B LoRA / QLoRA with large batch

Target workloads

8-GPU tensor-parallel inference at the 192 GB pool — Qwen3-235B Q4, GLM-4.5/4.6/4.7 Q4, Llama 4 Scout bf16
Dense 70B bf16 (Llama 3.3 70B) with massive KV headroom for long ctx and high batch
High-throughput batch inference gateway — vLLM / SGLang tensor-parallel at large batch
Fine-tuning of 7-34B class models with high-batch LoRA / QLoRA
Wan 2.2 dual-expert / HunyuanImage-3.0 / FLUX.1 full workflow video-image studio

Measured performance

Kentino bench (4-GPU reference) | 2026-04-10 | 4x RTX 4090 + EPYC 7542 + 512 GB DDR4 + ROMED8-2T

Benchmark	Result
Sustained compute (fp16, 4-card ref)	647 TFLOPS
vLLM — Llama 3.3 70B AWQ INT4 (single)	8.0 tok/s
vLLM — Llama 3.3 70B AWQ INT4 (batch-32)	179 tok/s aggregate
llama.cpp — Llama 3.3 70B Q4_K_M (single)	20.3 tok/s decode
8-GPU aggregate compute (extrapolation)	~1 294 TFLOPS fp16 expected (near-linear)
235B Q4 tensor-parallel 8-way (community)	15-25 tok/s single-stream on 8x RTX 4090

4-card data measured on Kentino hardware. 8-GPU extrapolation is published external reference. Kentino will publish first-party 8-GPU numbers after the first customer build.

Not ideal for

5090-generation workloads (Blackwell fp8 native + higher TOPS) — see K-AI 256 TurinDual 5090
Training from scratch (no NVLink on consumer RTX 4090)
ECC-sensitive 24/7 production — consumer RTX 4090 has no ECC; prefer 4x L40 or 2x RTX Pro 6000 Server Edition
Hunyuan / DeepSeek fp8 native — RTX 4090 is Ada, fp8 checkpoints upcast to bf16

Warranty and lead time

2 years

parts warranty

1 year

labor warranty

10-28 days

lead time

Build includes assembly, BIOS config with dual-socket NUMA tuning, driver install, burn-in, memtest, full 8-GPU stress test, and LLM environment setup. Lead time depends on component availability, confirmed at order.

Recommended add-ons

4 TB additional NVMe for weight staging and MoE offload workloads
NVIDIA ConnectX-5 100 GbE for multi-node serving
RAM upgrade to 1 TB (16x 64 GB) or 2 TB (32x 64 GB) — board supports 32 DIMM slots
Full 24U rack cabinet + online UPS 5 kVA