Kentino s.r.o.

K-AI 128 Rome 5090 6704TOPS — 4× RTX 5090 Blackwell AI Server

Name: K-AI 128 Rome 5090 6704TOPS — 4× RTX 5090 Blackwell AI Server
Brand: Kentino s.r.o.
Price: 25372.00 EUR
Availability: InStock

€25.372,00 EUR

In offerta Esaurito

Imposte incluse. Spese di spedizione calcolate al check-out.

Quantità

K-AI 128 Rome 5090 6704TOPS

128 GB VRAM Blackwell Inference Server
4x RTX 5090 | EPYC Milan | 6 704 TOPS INT8

6 704

INT8 TOPS

128 GB

VRAM pool

Blackwell

fp8 native

2.5x

vs 4090 TOPS

Four Blackwell RTX 5090 with native fp8/fp4 tensor paths. Highest-throughput 4-GPU build on the Rome platform.

A 4U rack-mount inference server with four GeForce RTX 5090 pooled to 128 GB VRAM, one AMD EPYC 7643 Milan CPU (48C/96T), 512 GB DDR4 ECC (all 8 DIMM slots populated for max bandwidth), 2 TB NVMe boot, and dual synchronized 2 kW ATX PSU. Runs vLLM, SGLang, llama.cpp, ComfyUI with Blackwell-native fp8 and MXFP4 inference kernels.

Hardware

Component	Detail
GPUs	4x NVIDIA GeForce RTX 5090 32 GB GDDR7 (Blackwell, 575 W, PCIe 5.0 x16)
VRAM pool	128 GB total across 4 cards (no NVLink on consumer 5090)
CPU	AMD EPYC 7643 Milan (48C/96T, 225 W, 128x PCIe 4.0 lanes)
Motherboard	ASRock Rack ROMED8-2T (SP3, 7x PCIe 4.0 x16, 8x DDR4 ECC, 2x 10 GbE, IPMI)
System RAM	512 GB DDR4-2666 ECC RDIMM (8x 64 GB — all DIMM slots populated)
Boot / storage	2 TB NVMe M.2 (PCIe 4.0 x4)
Power supply	Dual 2 kW ATX PSU with sync cable + 12VHPWR adapter set
Chassis	4U rack-mount, 4x GPU, passive PCIe 4.0 x16 risers
Cooling	Arctic Freezer 4U-M SP3 tower + 3x 120 mm front intake + 1x 120 mm rear exhaust
Network	Onboard dual 10 GbE (Intel X550)

Power envelope

GPU draw: 4 x 575 W = 2 300 W
System total at full load: ~2 650 W
PSU total: 4 000 W (dual 2 kW synced) — 33.8 % headroom
Dual PSU for split power delivery — each PSU feeds a portion of the system

Lane topology

ROMED8-2T fans out 128 PCIe Gen4 lanes from the EPYC directly to seven x16 slots; four populated by GPUs at Gen4 x16. No PCIe switch. No NVLink on consumer 5090 — inter-GPU peer-to-peer. Cards are Gen5 native; Rome caps at Gen4.

What you can run

With 128 GB pooled VRAM and Blackwell-native fp8 tensor paths, this server steps up to Qwen3-235B-A22B Q4 and gpt-oss-120b MXFP4 with real KV headroom — beyond what 4x RTX 4090 can reach.

LLMs — text / reasoning / coding

Chinese frontier

Qwen3 / Qwen3.5 (Alibaba): Qwen3-235B-A22B Q3-Q4 (~112-132 GB) fits the 128 GB pool with 8-16k ctx — the hero config; Qwen3-32B dense bf16 (~65 GB) with massive KV; Qwen3-Coder-30B-A3B agentic at 1M ctx; Qwen3.5-122B-A10B Q6/fp8 (~75-80 GB); QwQ-32B bf16 reasoning
DeepSeek: DeepSeek-V3/R1/V3.1/V3.2 fp8-native Q2 (~215 GB) with RAM spill across 512 GB host — feasible for batch; DeepSeek-R2 32B bf16 multi-stream (4 concurrent, one per card)
GLM / Z.ai: GLM-4.5-Air 106B/12B fp8 (~106 GB) or Q6 comfortably; GLM-4.5/4.6/4.7 Q2_K_XL (~135 GB) tight with MoE offload
Tencent Hunyuan: Hunyuan-A13B fp8 native (~80 GB) — Blackwell runs fp8 without upcast penalty; Hunyuan-Large Q2 with RAM spill
ByteDance Seed-OSS-36B bf16 with 512k native; ERNIE-4.5-424B Q2 (~150 GB spill)

Western frontier

Meta Llama: Llama 3.3 70B Q4 across 4x 5090 (~30-40 tok/s single-stream, ~270+ tok/s batch-32 vLLM); Llama 4 Scout 109B/17B MoE fp8/Q6 (~90 GB); Llama 4 Maverick 400B/17B Q3 (~188 GB spill)
Mistral: Mistral Small 3 / Magistral / Devstral Small 2 (24B) bf16 multi-stream; Pixtral Large / Mistral Large 2 (123B) Q6 (~88 GB)
OpenAI (open weights): gpt-oss-120b MXFP4 native (80 GB) with real KV and long context — Blackwell hero workload; gpt-oss-20b MXFP4
Google Gemma 3: 27B multimodal bf16 (~54 GB) two concurrent streams; 12B / 4B
Microsoft Phi-4 14B dense bf16; Phi-4-reasoning distilled
NVIDIA Nemotron: Llama-3.1-Nemotron Ultra 253B Q3 (~119 GB) tight; Super 49B bf16 (~98 GB)
Others: Cohere Command R+ 104B Q6 (~85 GB); Molmo 72B Q6-bf16 VLM; OLMo 2 32B; IBM Granite 4.0 H-Small

Vision-Language Models

Qwen3-VL-235B-A22B Q3-Q4; Qwen3-VL-32B bf16; InternVL3.5-241B-A28B Q4 (~135 GB tight); InternVL3 78B bf16; Llama 3.2 90B Vision Q6 (~74 GB); Pixtral Large 124B Q6 (~88 GB); Molmo 72B Q6/bf16; Gemma 3 27B multimodal bf16; GLM-4.6V 106B fp8.

Image generation

FLUX.1 [dev] bf16 and fp8 (~10-18 s/image at fp8); FLUX.1 Kontext [dev]; SD 3.5 Large bf16; HunyuanImage-2.1 bf16 and Q4; HunyuanImage-3.0 base (80B MoE, 13B active) bf16 (~80 GB, hero footprint); HunyuanDiT; Kolors / Kolors 2.0; AuraFlow v0.3; OmniGen v1; PixArt-Sigma.

Video generation

Wan 2.2 MoE two-expert bf16 (~54 GB, full ctx); Wan 2.2 TI2V-5B; HunyuanVideo 13B bf16 both experts (~60-80 GB); HunyuanVideo 1.5; CogVideoX-5B bf16; Open-Sora 2.0 11B bf16 (~24 GB); Genmo Mochi-1 bf16 (~42 GB); LTX-Video; Pyramid Flow; SVD / SV3D / SV4D; NVIDIA Cosmos.

Audio / Speech / TTS

ASR: Whisper v3 large / turbo (~50x realtime); Parakeet-TDT; Canary 1B; Qwen3-ASR; SenseVoice
TTS: CosyVoice 2 / 3; Kokoro 82M; Stable Audio Open; XTTS v2; StyleTTS 2; Step-Audio-EditX
Realtime / S2S: Kyutai Moshi 7B; Step-Audio 2 mini/R1; Qwen2.5-Omni-7B
Music / SFX: MusicGen / AudioGen / Bark; SeamlessM4T v2

Multi-model / multi-tenant serving

200B MoE at Q4 with batch inference (Qwen3-235B, GLM-4.5/4.6/4.7-Air) for 8-16 concurrent users
fp8-native frontier — DeepSeek V3 family, Hunyuan-Large fp8 with Blackwell native paths
Mixed resident stack: gpt-oss-120b MXFP4 + FLUX.1 + Whisper + Moshi on partitioned VRAM
High-throughput 70B — tensor-parallel vLLM / SGLang with 200+ tok/s batch aggregate

Target workloads

200B+ MoE production serving at Q3-Q4 with real KV (Qwen3-235B, GLM-4.5-Air 106B)
fp8-native frontier inference (DeepSeek V3/R1 fp8, Hunyuan fp8) — Blackwell runs without upcast
High-throughput 70B serving — tensor-parallel batch via vLLM or SGLang
Video generation studio at bf16 (Wan 2.2 dual-expert, HunyuanVideo 13B, Mochi-1)
Multi-tenant mixed workload — 120B MoE + image gen + realtime voice all resident

Measured performance

Published references | NVIDIA RTX 5090 datasheet + community benchmarks

Benchmark	Result
Per-card INT8 TOPS (NVIDIA datasheet)	1 676 TOPS
Aggregate INT8 TOPS (4 cards)	6 704 TOPS
Memory bandwidth per card	~1 792 GB/s
Llama 3.3 70B Q6 via vLLM (community)	60-90 tok/s single-stream, 300+ tok/s batch
Qwen3-235B-A22B Q3-Q4	Fits 128 GB pool with 8-16k ctx
gpt-oss-120b MXFP4 native	80 GB — comfortable with KV headroom

Published external references, not measured on Kentino hardware. Kentino will publish first-party numbers after the first customer build.

Not ideal for

Frontier 400B+ at Q4 (Kimi-K2, Mistral Large 3, Intern-S1-Pro — require 8-GPU or 6x RTX Pro 6000)
PCIe Gen5-link-sensitive workloads — pick the Genoa SKU for native Gen5 x16
Training from scratch (no NVLink on consumer 5090)
ECC-sensitive 24/7 production — consumer 5090 has no ECC; prefer L40 or RTX Pro 6000 Server Edition

Warranty and lead time

2 years

parts warranty

1 year

labor warranty

10-28 days

lead time

Build includes assembly, BIOS configuration, driver install, burn-in testing, and functional verification. Lead time depends on component availability, confirmed at order.

Recommended add-ons

Upgrade PSU to dual 2.5 kW (FSP) for sustained worst-case bf16 + video — recommended for 24/7
4 TB NVMe for model library + MoE weight staging
24U open cabinet for multi-server deployment
Consider the Genoa-platform variant on request for Gen5 x16 link