Skip to product information
1 of 14

Kentino s.r.o.

K-AI 64 Rome 5090 3352TOPS — 2x RTX 5090 Entry Blackwell AI Server

K-AI 64 Rome 5090 3352TOPS — 2x RTX 5090 Entry Blackwell AI Server

Regular price €11.653,00 EUR
Regular price Sale price €11.653,00 EUR
Sale Sold out
Taxes included. Shipping calculated at checkout.

K-AI 64 Rome 5090 3352TOPS

Entry Blackwell 2-GPU Server
2x RTX 5090 | EPYC Milan | 3 352 TOPS INT8

3 352
TOPS INT8
64 GB
VRAM GDDR7
fp8
native tensor
rack
ready

Entry Blackwell 2-GPU server — 64 GB pooled VRAM, 3 352 INT8 TOPS, native fp8. The Ada-to-Blackwell step-up from 2x4090.

A two-GPU Blackwell AI server built on ROMED8-2T / EPYC Milan. Two RTX 5090 deliver a 64 GB pooled VRAM envelope with native fp8 tensor math — roughly double the raw TOPS of 2x RTX 4090 in the same chassis footprint, and the first 2-GPU tier that comfortably runs Llama 3.3 70B Q4, Qwen3.5-122B-A10B Q4, and HunyuanVideo at bf16 / fp8 with headroom.

Hardware

Component Detail
GPUs 2x NVIDIA GeForce RTX 5090 32 GB GDDR7 (575 W, PCIe 5.0 x16, Blackwell)
VRAM pool 64 GB
CPU AMD EPYC 7643 Milan (48C/96T, 225 W, 128x PCIe 4.0 lanes)
Motherboard ASRock Rack ROMED8-2T (SP3, 7x PCIe 4.0 x16, 8x DDR4 ECC, 2x 10 GbE, IPMI)
System RAM 128 GB DDR4-2666 ECC RDIMM (2x 64 GB)
Boot / storage 1 TB NVMe M.2 (PCIe 4.0 x4)
Power supply Single 2 kW ATX PSU
Chassis 4U rack-mount, passive Gen4 x16 risers
Cooling SP3 tower cooler, 3x 120 mm front intake + 1x 120 mm rear exhaust (industrial fans)
Network Onboard dual 10 GbE (Intel X550) + IPMI

Power envelope

  • GPU draw: 2 x 575 W = 1 150 W
  • System total at full load: ~1 475 W
  • PSU total: 2 000 W (single 2 kW ATX) — 26.25 % headroom
  • Workable single-PSU margin; dual-PSU upgrade available for extra headroom

Lane topology

ROMED8-2T fans out 2x16 Gen4 from CPU root complex. 5090 is Gen5 silicon running Gen4 x16 without bandwidth penalty for inference. No PCIe switch. No NVLink on GeForce 5090 — tensor-parallel 2-way P2P uses PCIe.

What you can run

With 64 GB of pooled GDDR7 VRAM across 2 Blackwell cards, this server handles 70B Q4 tensor-parallel, MoE flagships, native fp8 image generation, video AI, and multi-model concurrent serving.

LLMs — text / reasoning / coding

Chinese frontier

  • Qwen3-32B Q8 / bf16 (near-fp16 quality) (~40-55 tok/s single-stream on Blackwell fp8, published reference)
  • QwQ-32B bf16; Qwen3-30B-A3B / Coder-30B-A3B bf16 (~60 GB fits)
  • Qwen3.5-122B-A10B Q4 (~70-75 GB with RAM spill) — MoE flagship at Q4 fits
  • Hunyuan-A13B fp8 (~80 GB tight) or Q6 (~36 GB comfortable)
  • Seed-OSS-36B bf16 (~72 GB tight — prefer fp8 ~36 GB)
  • DeepSeek-R2 32B sparse MoE bf16
  • GLM-4.5-Air 106B/12B Q4_K_M (~60 GB) — MoE with headroom
  • ERNIE-4.5-47B-A3B Q6-Q8

Western frontier

  • Llama 3.3 70B Q4_K_M (~43 GB) — the headline workload for this tier (~20-28 tok/s single-stream on 2x 5090, published reference)
  • Hermes 3 70B / Tulu 3 70B Q4 — open post-training Llama derivatives
  • Mistral Small 3 / Magistral / Devstral Small 2 24B bf16; Mixtral 8x7B bf16
  • Gemma 3 27B multimodal bf16 + reasoning headroom
  • Phi-4 14B bf16; Nemotron-Super 49B Q6-Q8
  • gpt-oss-20b MXFP4 (16 GB) + gpt-oss-120b MXFP4 (80 GB — fits tight with short ctx)
  • OLMo 2 32B / OLMo 3.1-32B-Think bf16

Vision-Language

Qwen3-VL-32B / Qwen3-VL-30B-A3B / Qwen3-Omni-30B-A3B bf16; InternVL3.5-38B bf16; Llama 3.2 90B Vision Q4 (~52 GB); Pixtral 12B bf16; Pixtral Large 124B Q3 (~58 GB tight); Gemma 3 27B multimodal bf16; PaliGemma 2 28B bf16; Molmo 72B Q4 (~45 GB).

Image generation

5090 native fp8 is the speed story — FLUX.1 / SD 3.5 / HunyuanImage run materially faster than on Ada: FLUX.1 [dev] / [schnell] fp8 native (~12 GB) with 2x parallel across cards (~8-12 seconds per 1024x1024 image on Blackwell, published reference); FLUX.1 Kontext [dev]; SD 3.5 Large (18 GB fp16 or 11 GB fp8); SDXL 1.0; HunyuanImage-2.1 bf16 (~34 GB); HunyuanImage-3.0 NF4; AuraFlow v0.3 / OmniGen v1 / Kolors 2.0.

Video generation

Wan 2.2 T2V-A14B / I2V-A14B bf16 (~54 GB total) — MoE two-expert at full precision; Wan 2.2 TI2V-5B bf16 per-card, 2 parallel tenants; HunyuanVideo 13B Q4-Q5 (~30 GB), fp8 tight; HunyuanVideo 1.5 (8.3B) bf16 per-card; Open-Sora 2.0 (11B) bf16; CogVideoX-5B / 1.5 bf16; Mochi-1 bf16 (~42 GB fits); LTX-Video 2B; NVIDIA Cosmos Predict 2.

Audio / Speech / TTS

Same full Chinese + Western speech stack as the 4090 tier fits with more headroom: Whisper v3 + Parakeet + Canary + Moshi + Step-Audio 2 / R1 + CosyVoice 3.0 + Kokoro + Stable Audio Open + MusicGen + AudioGen + SeamlessM4T v2 + MMS. On fp8-native 5090, Whisper / Parakeet decode at materially higher real-time factor. Whisper v3 turbo runs at ~75x realtime on Blackwell (published reference).

Multi-model / multi-tenant

  • Resident stack: Llama 3.3 70B Q4 (~43 GB tensor-parallel 2-way) + FLUX.1 fp8 (~12 GB) + Whisper-turbo + Moshi
  • 2-4 concurrent tenants on 32B class at Q6-Q8 per card
  • LoRA / QLoRA fine-tuning of 7-14B comfortable, 24-32B tight

Target workloads

  • Small-team developer workstation with 70B Q4 serving headroom
  • Blackwell step-up from a 2x RTX 4090 box — same chassis, ~2.5x TOPS, fp8 native
  • Image / video generation workstation with FLUX native fp8 speedup
  • Multi-model concurrent box: 70B Q4 + FLUX + Whisper + Moshi resident simultaneously
  • 4-8 concurrent user inference endpoint for 32B class LLMs

Published performance references

Published reference | 2x RTX 5090 comparable hardware

Benchmark Result
Llama 3.3 70B Q4_K_M llama.cpp decode ~20-28 tok/s single-stream
Qwen3-32B Q8 vLLM single-stream ~45-60 tok/s decode at fp8
FLUX.1 [dev] fp8 native Blackwell ~1.5-1.9 s per 1024x1024 at 20 steps
HunyuanVideo 13B Q5 TP-2 5 s 720p in ~5-7 min

Published, not measured on Kentino hardware. Kentino measured reference on 4x RTX 4090: 647 TFLOPS fp16, 179 tok/s batch-32 aggregate.

Not ideal for

  • 100B+ dense models at bf16 (DeepSeek-V3, Kimi K2, Mistral Large 3 — need 256+ GB pool)
  • Frontier video generation at bf16 long-form full-resolution

Warranty and lead time

2 years
parts warranty
1 year
labor warranty
10-28 days
lead time

Build includes assembly, BIOS configuration, driver install, burn-in testing, and functional verification. Lead time depends on component availability, confirmed at order.

Recommended add-ons

  • NVIDIA ConnectX-5 100 GbE MCX555A-ECAT
  • Upgrade boot drive to 2 TB NVMe — or 4 TB
  • Upgrade RAM to 256 GB (4x 64 GB) — MoE KV cache headroom / multi-model concurrent serving
  • Rack PDU (C13/C19 metered) and 3 kVA online UPS
View full details