Kentino s.r.o.
K-AI 64 Rome 5090 3352TOPS — 2x RTX 5090 Entry Blackwell AI Server
K-AI 64 Rome 5090 3352TOPS — 2x RTX 5090 Entry Blackwell AI Server
Couldn't load pickup availability
K-AI 64 Rome 5090 3352TOPS
Entry Blackwell 2-GPU Server
2x RTX 5090 | EPYC Milan | 3 352 TOPS INT8
Entry Blackwell 2-GPU server — 64 GB pooled VRAM, 3 352 INT8 TOPS, native fp8. The Ada-to-Blackwell step-up from 2x4090.
A two-GPU Blackwell AI server built on ROMED8-2T / EPYC Milan. Two RTX 5090 deliver a 64 GB pooled VRAM envelope with native fp8 tensor math — roughly double the raw TOPS of 2x RTX 4090 in the same chassis footprint, and the first 2-GPU tier that comfortably runs Llama 3.3 70B Q4, Qwen3.5-122B-A10B Q4, and HunyuanVideo at bf16 / fp8 with headroom.
Hardware
| Component | Detail |
|---|---|
| GPUs | 2x NVIDIA GeForce RTX 5090 32 GB GDDR7 (575 W, PCIe 5.0 x16, Blackwell) |
| VRAM pool | 64 GB |
| CPU | AMD EPYC 7643 Milan (48C/96T, 225 W, 128x PCIe 4.0 lanes) |
| Motherboard | ASRock Rack ROMED8-2T (SP3, 7x PCIe 4.0 x16, 8x DDR4 ECC, 2x 10 GbE, IPMI) |
| System RAM | 128 GB DDR4-2666 ECC RDIMM (2x 64 GB) |
| Boot / storage | 1 TB NVMe M.2 (PCIe 4.0 x4) |
| Power supply | Single 2 kW ATX PSU |
| Chassis | 4U rack-mount, passive Gen4 x16 risers |
| Cooling | SP3 tower cooler, 3x 120 mm front intake + 1x 120 mm rear exhaust (industrial fans) |
| Network | Onboard dual 10 GbE (Intel X550) + IPMI |
Power envelope
- GPU draw: 2 x 575 W = 1 150 W
- System total at full load: ~1 475 W
- PSU total: 2 000 W (single 2 kW ATX) — 26.25 % headroom
- Workable single-PSU margin; dual-PSU upgrade available for extra headroom
Lane topology
ROMED8-2T fans out 2x16 Gen4 from CPU root complex. 5090 is Gen5 silicon running Gen4 x16 without bandwidth penalty for inference. No PCIe switch. No NVLink on GeForce 5090 — tensor-parallel 2-way P2P uses PCIe.
What you can run
With 64 GB of pooled GDDR7 VRAM across 2 Blackwell cards, this server handles 70B Q4 tensor-parallel, MoE flagships, native fp8 image generation, video AI, and multi-model concurrent serving.
LLMs — text / reasoning / coding
Chinese frontier
- Qwen3-32B Q8 / bf16 (near-fp16 quality) (~40-55 tok/s single-stream on Blackwell fp8, published reference)
- QwQ-32B bf16; Qwen3-30B-A3B / Coder-30B-A3B bf16 (~60 GB fits)
- Qwen3.5-122B-A10B Q4 (~70-75 GB with RAM spill) — MoE flagship at Q4 fits
- Hunyuan-A13B fp8 (~80 GB tight) or Q6 (~36 GB comfortable)
- Seed-OSS-36B bf16 (~72 GB tight — prefer fp8 ~36 GB)
- DeepSeek-R2 32B sparse MoE bf16
- GLM-4.5-Air 106B/12B Q4_K_M (~60 GB) — MoE with headroom
- ERNIE-4.5-47B-A3B Q6-Q8
Western frontier
- Llama 3.3 70B Q4_K_M (~43 GB) — the headline workload for this tier (~20-28 tok/s single-stream on 2x 5090, published reference)
- Hermes 3 70B / Tulu 3 70B Q4 — open post-training Llama derivatives
- Mistral Small 3 / Magistral / Devstral Small 2 24B bf16; Mixtral 8x7B bf16
- Gemma 3 27B multimodal bf16 + reasoning headroom
- Phi-4 14B bf16; Nemotron-Super 49B Q6-Q8
- gpt-oss-20b MXFP4 (16 GB) + gpt-oss-120b MXFP4 (80 GB — fits tight with short ctx)
- OLMo 2 32B / OLMo 3.1-32B-Think bf16
Vision-Language
Qwen3-VL-32B / Qwen3-VL-30B-A3B / Qwen3-Omni-30B-A3B bf16; InternVL3.5-38B bf16; Llama 3.2 90B Vision Q4 (~52 GB); Pixtral 12B bf16; Pixtral Large 124B Q3 (~58 GB tight); Gemma 3 27B multimodal bf16; PaliGemma 2 28B bf16; Molmo 72B Q4 (~45 GB).
Image generation
5090 native fp8 is the speed story — FLUX.1 / SD 3.5 / HunyuanImage run materially faster than on Ada: FLUX.1 [dev] / [schnell] fp8 native (~12 GB) with 2x parallel across cards (~8-12 seconds per 1024x1024 image on Blackwell, published reference); FLUX.1 Kontext [dev]; SD 3.5 Large (18 GB fp16 or 11 GB fp8); SDXL 1.0; HunyuanImage-2.1 bf16 (~34 GB); HunyuanImage-3.0 NF4; AuraFlow v0.3 / OmniGen v1 / Kolors 2.0.
Video generation
Wan 2.2 T2V-A14B / I2V-A14B bf16 (~54 GB total) — MoE two-expert at full precision; Wan 2.2 TI2V-5B bf16 per-card, 2 parallel tenants; HunyuanVideo 13B Q4-Q5 (~30 GB), fp8 tight; HunyuanVideo 1.5 (8.3B) bf16 per-card; Open-Sora 2.0 (11B) bf16; CogVideoX-5B / 1.5 bf16; Mochi-1 bf16 (~42 GB fits); LTX-Video 2B; NVIDIA Cosmos Predict 2.
Audio / Speech / TTS
Same full Chinese + Western speech stack as the 4090 tier fits with more headroom: Whisper v3 + Parakeet + Canary + Moshi + Step-Audio 2 / R1 + CosyVoice 3.0 + Kokoro + Stable Audio Open + MusicGen + AudioGen + SeamlessM4T v2 + MMS. On fp8-native 5090, Whisper / Parakeet decode at materially higher real-time factor. Whisper v3 turbo runs at ~75x realtime on Blackwell (published reference).
Multi-model / multi-tenant
- Resident stack: Llama 3.3 70B Q4 (~43 GB tensor-parallel 2-way) + FLUX.1 fp8 (~12 GB) + Whisper-turbo + Moshi
- 2-4 concurrent tenants on 32B class at Q6-Q8 per card
- LoRA / QLoRA fine-tuning of 7-14B comfortable, 24-32B tight
Target workloads
- Small-team developer workstation with 70B Q4 serving headroom
- Blackwell step-up from a 2x RTX 4090 box — same chassis, ~2.5x TOPS, fp8 native
- Image / video generation workstation with FLUX native fp8 speedup
- Multi-model concurrent box: 70B Q4 + FLUX + Whisper + Moshi resident simultaneously
- 4-8 concurrent user inference endpoint for 32B class LLMs
Published performance references
Published reference | 2x RTX 5090 comparable hardware
| Benchmark | Result |
|---|---|
| Llama 3.3 70B Q4_K_M llama.cpp decode | ~20-28 tok/s single-stream |
| Qwen3-32B Q8 vLLM single-stream | ~45-60 tok/s decode at fp8 |
| FLUX.1 [dev] fp8 native Blackwell | ~1.5-1.9 s per 1024x1024 at 20 steps |
| HunyuanVideo 13B Q5 TP-2 | 5 s 720p in ~5-7 min |
Published, not measured on Kentino hardware. Kentino measured reference on 4x RTX 4090: 647 TFLOPS fp16, 179 tok/s batch-32 aggregate.
Not ideal for
- 100B+ dense models at bf16 (DeepSeek-V3, Kimi K2, Mistral Large 3 — need 256+ GB pool)
- Frontier video generation at bf16 long-form full-resolution
Warranty and lead time
Build includes assembly, BIOS configuration, driver install, burn-in testing, and functional verification. Lead time depends on component availability, confirmed at order.
Recommended add-ons
- NVIDIA ConnectX-5 100 GbE MCX555A-ECAT
- Upgrade boot drive to 2 TB NVMe — or 4 TB
- Upgrade RAM to 256 GB (4x 64 GB) — MoE KV cache headroom / multi-model concurrent serving
- Rack PDU (C13/C19 metered) and 3 kVA online UPS
Share
