Skip to product information
1 of 7

Kentino s.r.o.

K-AI 256 TurinDual 5090 — 8× RTX 5090 Dual-Socket Zen5c Flagship (Request Quote on CPU)

K-AI 256 TurinDual 5090 — 8× RTX 5090 Dual-Socket Zen5c Flagship (Request Quote on CPU)

Regular price €0,00 EUR
Regular price Sale price €0,00 EUR
Sale Sold out
Taxes included. Shipping calculated at checkout.

K-AI 256 TurinDual 5090 13408TOPS

256 GB VRAM Flagship Inference Server
8x RTX 5090 | Dual EPYC Turin | 13 408 TOPS INT8

13 408
TOPS INT8
256 GB
VRAM pool
fp8
Blackwell native
Gen5
PCIe end-to-end

CPU pricing finalized at order — Turin 9005-series market moves weekly in Q2 2026.

Published external references. Not measured on Kentino hardware.

A 7U rack-mount flagship inference server with eight GeForce RTX 5090 (32 GB GDDR7, Blackwell, fp8 native) on a dual-socket EPYC Turin (Zen5c, SP5) platform with 768 GB DDR5-4800 ECC across all 12 channels, 2 TB NVMe boot, and 5x 1200 W server PSU. End-to-end PCIe Gen5 at the GPU via active retimer/redriver risers. Runs vLLM, SGLang, llama.cpp, ComfyUI and every major open-weight inference stack out of the box.

Hardware

Component Detail
GPUs 8x NVIDIA GeForce RTX 5090 32 GB GDDR7 (Blackwell, 575 W TGP, PCIe 5.0 x16, fp8 native, 1676 INT8 TOPS/card)
VRAM pool 256 GB aggregate across 8 cards (no NVLink on consumer RTX 5090)
CPU 2x AMD EPYC Turin 9005-series (Zen5c, SP5, PCIe 5.0) — quote-pending at order
Motherboard ASRock Rack TURIN2D24XGM/500W (dual SP5, PCIe 5.0, 24x DDR5 DIMM)
System RAM 768 GB DDR5-4800 ECC RDIMM (12x 64 GB — all 12 channels populated; 12 slots remain for scale to 1.5 TB)
Boot / storage 2 TB NVMe M.2 (PCIe 4.0 x4)
Power supply 5x 1200 W server PSU set (HP-compatible, 6 kW aggregate)
Chassis 7U 8-GPU (up to 10 PCIe slots, separate PSU bays)
Cooling 2x SP5 tower coolers + rack-mount front-to-back airflow (industrial fans)
Risers 8x active PCIe Gen5 x16 (retimer/redriver) — end-to-end Gen5
Network Onboard 10 GbE (board-dependent)

Power envelope

  • GPU draw: 8 x 575 W = 4 600 W
  • System total at full load: ~5 520 W
  • PSU total: 6 000 W (5x 1200 W) — 8% headroom at spec
  • Kentino ships with GPU power-cap at 500 W — total drops to ~4 920 W (~15% headroom)

Lane topology

Dual Turin provides 2x 128 = 256 PCIe Gen5 lanes host-side. Active Gen5 risers carry Gen5 x16 end-to-end at each GPU — no PCIe switch required (one CPU per 4-card bank). No NVLink; inter-GPU P2P at Gen5 x16 (~60 GB/s nominal per link).

What you can run

With 256 GB of pooled VRAM across 8 Blackwell cards with fp8 native, this server targets frontier 235-480B MoE at Q4 with real context, DeepSeek V3 family at Q2, and Kimi-K2 1.58-bit dynamic-quant at real throughput.

LLMs — text / reasoning / coding

Chinese frontier

  • Qwen3-235B-A22B (Instruct / Thinking / "2507") Q4 (~132 GB) with long context + multi-user batching (~25-40 tok/s single-stream on 8x RTX 5090, published reference)
  • GLM-4.5 / 4.6 / 4.7 Q4 (~177 GB) — flagship reasoning/coding, 200k ctx on 4.6+
  • GLM-5 / GLM-5.1 Q2 (~260 GB) with minor RAM spill — frontier coding close to Claude Opus 4.6
  • DeepSeek V3 / R1 / V3.1 / V3.2 / V3.2-Speciale Q2 (~215 GB) at useful inference speed (~28 tok/s single-stream on 8x Blackwell, published reference)
  • Kimi-K2 1.58-bit UD-TQ1_0 (~240 GB) — trillion-parameter agent at real token throughput (~7-10 tok/s single-stream, published reference)
  • Hunyuan-Large 389B/52B MoE Q4 (~220 GB); ERNIE-4.5-424B-A47B Q4 (~240 GB)
  • Qwen3-Coder-480B-A35B Q4 (~270 GB tight with RAM spill) — SOTA open coding flagship
  • MiniMax-M1 / Text-01 Q4 (~260 GB) 1M context; Qwen3.5-397B-A17B Q4 (~214 GB)

Western frontier

  • Mistral Large 3 (675B/41B MoE, Apache 2.0) Q3 (~317 GB with spill) — Western frontier open weights
  • Llama 4 Maverick (400B/17B, 128 experts) Q4 (~232 GB) multimodal
  • Llama-3.1-Nemotron Ultra 253B Q4 (~119 GB) — matches DeepSeek-R1 at half size
  • gpt-oss-120b MXFP4 native (80 GB) comfortably with room for multiple models
  • Devstral 2 123B (Modified MIT) Q6 — top open coding, 256k ctx
  • Llama 3.3 70B bf16 (~142 GB) multi-tenant serving (~30-40 tok/s single-stream per RTX 5090 pair TP2, published reference)

Vision-Language Models

Qwen3-VL-235B-A22B full bf16 (~240 GB on-card); InternVL3.5-241B-A28B (~135 GB Q4); Llama 3.2 90B Vision bf16; Pixtral Large 124B bf16 (~248 GB tight); Qwen3-Omni-30B-A3B; Molmo 72B; ERNIE-4.5-VL; GLM-4.6V full. Blackwell fp8 path gives ~2x throughput on vision-tower inference vs Ada.

Image generation

FLUX.1 [dev] / Kontext / Tools full bf16 (~10-18 s/image at fp8 per card, published reference); SD 3.5 Large; HunyuanImage-2.1 (17B, native 2K); HunyuanImage-3.0 80B/13B MoE; AuraFlow; OmniGen; multi-worker ComfyUI farms.

Video generation

Wan 2.2 T2V-A14B / I2V-A14B dual expert bf16 (both high-noise + low-noise resident simultaneously); HunyuanVideo 13B bf16 both experts; Open-Sora 2.0 (11B) bf16; CogVideoX-5B; Mochi-1; LTX-Video; Pyramid Flow; SVD / SV3D / SV4D; NVIDIA Cosmos Predict 2.

Audio / Speech / TTS

  • ASR: Whisper v3 large / turbo (~50x realtime); Parakeet-TDT 1.1B; Canary 1B; Qwen3-ASR; SenseVoice
  • TTS: CosyVoice 2 / 3; Kokoro; Stable Audio Open; XTTS v2; Step-Audio-EditX
  • Realtime / S2S: Kyutai Moshi; Step-Audio 2 mini / R1; Qwen2.5-Omni-7B
  • Music / SFX: MusicGen; AudioGen; Bark; SeamlessM4T v2

Multi-model / multi-tenant serving

  • Frontier-inference gateway — 200B+ MoE + concurrent 70B + image + video all resident
  • 8-way tensor-parallel for Kimi-K2 / DeepSeek V3 at real context
  • Multi-tenant LLM API — 50-100 concurrent users on 235B Q4 via vLLM/SGLang
  • Full Chinese + Western frontier residency concurrently for evaluation / benchmarking

Target workloads

  • Frontier open-weight inference backend for a 100-500 seat org, mixing Qwen3-235B, GLM-4.5+, and DeepSeek V3 Q2
  • Kimi-K2 1.58-bit agent platform at production throughput (tool-use, 200+ sequential calls)
  • Full-fp8 DeepSeek V3 / R1 serving on Blackwell silicon
  • Multi-node training head with Gen5 100 GbE / InfiniBand fabric
  • Dual-role inference + diffusion farm (Qwen3-235B + FLUX.1 + HunyuanVideo 13B concurrently)

Published performance references

External references | Not measured on Kentino hardware

Benchmark Result
RTX 5090 per-card INT8 TOPS 1 676 TOPS
RTX 5090 memory bandwidth ~1 800 GB/s per card
vLLM — Qwen3-235B Q4_K_M on 4x RTX 5090 (single) ~90 tok/s
vLLM — Qwen3-235B Q4_K_M on 4x RTX 5090 (batch-32) ~450 tok/s aggregate
SGLang — DeepSeek V3 Q2 on 8x Blackwell (single) ~28 tok/s
llama.cpp — Kimi-K2 UD-TQ1_0 on 8x Blackwell 256 GB ~7-10 tok/s

Kentino will publish first-party tok/s after the first customer build with final Turin SKU.

Not ideal for

  • Budget-conscious deployments (Turin premium vs Genoa or Rome alternatives)
  • Single-tenant 70B dense workloads (overkill — 4x RTX 5090 or 4x RTX Pro 6000 is the right tier)
  • Frontier 600B+ at Q4+ full context (require 576 GB+ pool — see 6x RTX Pro 6000)
  • Sustained training from scratch (no NVLink on consumer RTX 5090)

Warranty and lead time

2 years
parts warranty
1 year
labor warranty
10-28 days
lead time

Build includes assembly, BIOS configuration, driver install, burn-in testing, and functional verification. Lead time depends on component availability, confirmed at order.

Recommended add-ons

  • Scale RAM to 1.5 TB DDR5 (24x 64 GB full population) — required for Kimi-K2 Q4 or DeepSeek V3 Q3 without RAM spill
  • NVIDIA ConnectX-5 100 GbE MCX555A-ECAT — Gen5 fabric for cluster nodes
  • Mellanox ConnectX-6 25 GbE SFP28 for datacenter fabric
  • 4 TB NVMe Gen4 x4 for boot + model library
  • Full 24U rack cabinet with managed PDU
  • Online UPS 8-10 kVA (critical — 5.5 kW peak draw)
View full details