Kentino s.r.o.
K-AI 256 TurinDual 5090 — 8× RTX 5090 Dual-Socket Zen5c Flagship (Request Quote on CPU)
K-AI 256 TurinDual 5090 — 8× RTX 5090 Dual-Socket Zen5c Flagship (Request Quote on CPU)
Couldn't load pickup availability
K-AI 256 TurinDual 5090 13408TOPS
256 GB VRAM Flagship Inference Server
8x RTX 5090 | Dual EPYC Turin | 13 408 TOPS INT8
CPU pricing finalized at order — Turin 9005-series market moves weekly in Q2 2026.
Published external references. Not measured on Kentino hardware.
A 7U rack-mount flagship inference server with eight GeForce RTX 5090 (32 GB GDDR7, Blackwell, fp8 native) on a dual-socket EPYC Turin (Zen5c, SP5) platform with 768 GB DDR5-4800 ECC across all 12 channels, 2 TB NVMe boot, and 5x 1200 W server PSU. End-to-end PCIe Gen5 at the GPU via active retimer/redriver risers. Runs vLLM, SGLang, llama.cpp, ComfyUI and every major open-weight inference stack out of the box.
Hardware
| Component | Detail |
|---|---|
| GPUs | 8x NVIDIA GeForce RTX 5090 32 GB GDDR7 (Blackwell, 575 W TGP, PCIe 5.0 x16, fp8 native, 1676 INT8 TOPS/card) |
| VRAM pool | 256 GB aggregate across 8 cards (no NVLink on consumer RTX 5090) |
| CPU | 2x AMD EPYC Turin 9005-series (Zen5c, SP5, PCIe 5.0) — quote-pending at order |
| Motherboard | ASRock Rack TURIN2D24XGM/500W (dual SP5, PCIe 5.0, 24x DDR5 DIMM) |
| System RAM | 768 GB DDR5-4800 ECC RDIMM (12x 64 GB — all 12 channels populated; 12 slots remain for scale to 1.5 TB) |
| Boot / storage | 2 TB NVMe M.2 (PCIe 4.0 x4) |
| Power supply | 5x 1200 W server PSU set (HP-compatible, 6 kW aggregate) |
| Chassis | 7U 8-GPU (up to 10 PCIe slots, separate PSU bays) |
| Cooling | 2x SP5 tower coolers + rack-mount front-to-back airflow (industrial fans) |
| Risers | 8x active PCIe Gen5 x16 (retimer/redriver) — end-to-end Gen5 |
| Network | Onboard 10 GbE (board-dependent) |
Power envelope
- GPU draw: 8 x 575 W = 4 600 W
- System total at full load: ~5 520 W
- PSU total: 6 000 W (5x 1200 W) — 8% headroom at spec
- Kentino ships with GPU power-cap at 500 W — total drops to ~4 920 W (~15% headroom)
Lane topology
Dual Turin provides 2x 128 = 256 PCIe Gen5 lanes host-side. Active Gen5 risers carry Gen5 x16 end-to-end at each GPU — no PCIe switch required (one CPU per 4-card bank). No NVLink; inter-GPU P2P at Gen5 x16 (~60 GB/s nominal per link).
What you can run
With 256 GB of pooled VRAM across 8 Blackwell cards with fp8 native, this server targets frontier 235-480B MoE at Q4 with real context, DeepSeek V3 family at Q2, and Kimi-K2 1.58-bit dynamic-quant at real throughput.
LLMs — text / reasoning / coding
Chinese frontier
- Qwen3-235B-A22B (Instruct / Thinking / "2507") Q4 (~132 GB) with long context + multi-user batching (~25-40 tok/s single-stream on 8x RTX 5090, published reference)
- GLM-4.5 / 4.6 / 4.7 Q4 (~177 GB) — flagship reasoning/coding, 200k ctx on 4.6+
- GLM-5 / GLM-5.1 Q2 (~260 GB) with minor RAM spill — frontier coding close to Claude Opus 4.6
- DeepSeek V3 / R1 / V3.1 / V3.2 / V3.2-Speciale Q2 (~215 GB) at useful inference speed (~28 tok/s single-stream on 8x Blackwell, published reference)
- Kimi-K2 1.58-bit UD-TQ1_0 (~240 GB) — trillion-parameter agent at real token throughput (~7-10 tok/s single-stream, published reference)
- Hunyuan-Large 389B/52B MoE Q4 (~220 GB); ERNIE-4.5-424B-A47B Q4 (~240 GB)
- Qwen3-Coder-480B-A35B Q4 (~270 GB tight with RAM spill) — SOTA open coding flagship
- MiniMax-M1 / Text-01 Q4 (~260 GB) 1M context; Qwen3.5-397B-A17B Q4 (~214 GB)
Western frontier
- Mistral Large 3 (675B/41B MoE, Apache 2.0) Q3 (~317 GB with spill) — Western frontier open weights
- Llama 4 Maverick (400B/17B, 128 experts) Q4 (~232 GB) multimodal
- Llama-3.1-Nemotron Ultra 253B Q4 (~119 GB) — matches DeepSeek-R1 at half size
- gpt-oss-120b MXFP4 native (80 GB) comfortably with room for multiple models
- Devstral 2 123B (Modified MIT) Q6 — top open coding, 256k ctx
- Llama 3.3 70B bf16 (~142 GB) multi-tenant serving (~30-40 tok/s single-stream per RTX 5090 pair TP2, published reference)
Vision-Language Models
Qwen3-VL-235B-A22B full bf16 (~240 GB on-card); InternVL3.5-241B-A28B (~135 GB Q4); Llama 3.2 90B Vision bf16; Pixtral Large 124B bf16 (~248 GB tight); Qwen3-Omni-30B-A3B; Molmo 72B; ERNIE-4.5-VL; GLM-4.6V full. Blackwell fp8 path gives ~2x throughput on vision-tower inference vs Ada.
Image generation
FLUX.1 [dev] / Kontext / Tools full bf16 (~10-18 s/image at fp8 per card, published reference); SD 3.5 Large; HunyuanImage-2.1 (17B, native 2K); HunyuanImage-3.0 80B/13B MoE; AuraFlow; OmniGen; multi-worker ComfyUI farms.
Video generation
Wan 2.2 T2V-A14B / I2V-A14B dual expert bf16 (both high-noise + low-noise resident simultaneously); HunyuanVideo 13B bf16 both experts; Open-Sora 2.0 (11B) bf16; CogVideoX-5B; Mochi-1; LTX-Video; Pyramid Flow; SVD / SV3D / SV4D; NVIDIA Cosmos Predict 2.
Audio / Speech / TTS
- ASR: Whisper v3 large / turbo (~50x realtime); Parakeet-TDT 1.1B; Canary 1B; Qwen3-ASR; SenseVoice
- TTS: CosyVoice 2 / 3; Kokoro; Stable Audio Open; XTTS v2; Step-Audio-EditX
- Realtime / S2S: Kyutai Moshi; Step-Audio 2 mini / R1; Qwen2.5-Omni-7B
- Music / SFX: MusicGen; AudioGen; Bark; SeamlessM4T v2
Multi-model / multi-tenant serving
- Frontier-inference gateway — 200B+ MoE + concurrent 70B + image + video all resident
- 8-way tensor-parallel for Kimi-K2 / DeepSeek V3 at real context
- Multi-tenant LLM API — 50-100 concurrent users on 235B Q4 via vLLM/SGLang
- Full Chinese + Western frontier residency concurrently for evaluation / benchmarking
Target workloads
- Frontier open-weight inference backend for a 100-500 seat org, mixing Qwen3-235B, GLM-4.5+, and DeepSeek V3 Q2
- Kimi-K2 1.58-bit agent platform at production throughput (tool-use, 200+ sequential calls)
- Full-fp8 DeepSeek V3 / R1 serving on Blackwell silicon
- Multi-node training head with Gen5 100 GbE / InfiniBand fabric
- Dual-role inference + diffusion farm (Qwen3-235B + FLUX.1 + HunyuanVideo 13B concurrently)
Published performance references
External references | Not measured on Kentino hardware
| Benchmark | Result |
|---|---|
| RTX 5090 per-card INT8 TOPS | 1 676 TOPS |
| RTX 5090 memory bandwidth | ~1 800 GB/s per card |
| vLLM — Qwen3-235B Q4_K_M on 4x RTX 5090 (single) | ~90 tok/s |
| vLLM — Qwen3-235B Q4_K_M on 4x RTX 5090 (batch-32) | ~450 tok/s aggregate |
| SGLang — DeepSeek V3 Q2 on 8x Blackwell (single) | ~28 tok/s |
| llama.cpp — Kimi-K2 UD-TQ1_0 on 8x Blackwell 256 GB | ~7-10 tok/s |
Kentino will publish first-party tok/s after the first customer build with final Turin SKU.
Not ideal for
- Budget-conscious deployments (Turin premium vs Genoa or Rome alternatives)
- Single-tenant 70B dense workloads (overkill — 4x RTX 5090 or 4x RTX Pro 6000 is the right tier)
- Frontier 600B+ at Q4+ full context (require 576 GB+ pool — see 6x RTX Pro 6000)
- Sustained training from scratch (no NVLink on consumer RTX 5090)
Warranty and lead time
Build includes assembly, BIOS configuration, driver install, burn-in testing, and functional verification. Lead time depends on component availability, confirmed at order.
Recommended add-ons
- Scale RAM to 1.5 TB DDR5 (24x 64 GB full population) — required for Kimi-K2 Q4 or DeepSeek V3 Q3 without RAM spill
- NVIDIA ConnectX-5 100 GbE MCX555A-ECAT — Gen5 fabric for cluster nodes
- Mellanox ConnectX-6 25 GbE SFP28 for datacenter fabric
- 4 TB NVMe Gen4 x4 for boot + model library
- Full 24U rack cabinet with managed PDU
- Online UPS 8-10 kVA (critical — 5.5 kW peak draw)
Share
