Kentino s.r.o.

K-AI 64 Rome 5090 3352TOPS — 2x RTX 5090 Entry Blackwell AI Server

Name: K-AI 64 Rome 5090 3352TOPS — 2x RTX 5090 Entry Blackwell AI Server
Brand: Kentino s.r.o.
Price: 18250.00 EUR
Availability: InStock

€18.250,00 EUR

Sale Sold out

Shipping calculated at checkout.

Quantity

K-AI 64 Rome 5090 3352TOPS

Entry Blackwell 2-GPU Server
2x RTX 5090 | EPYC Milan | 3 352 TOPS INT8

3 352

TOPS INT8

64 GB

VRAM GDDR7

fp8

native tensor

rack

ready

Entry Blackwell 2-GPU server — 64 GB pooled VRAM, 3 352 INT8 TOPS, native fp8. The Ada-to-Blackwell step-up from 2x4090.

A two-GPU Blackwell AI server built on ROMED8-2T / EPYC Milan. Two RTX 5090 deliver a 64 GB pooled VRAM envelope with native fp8 tensor math — roughly double the raw TOPS of 2x RTX 4090 in the same chassis footprint, and the first 2-GPU tier that comfortably runs Llama 3.3 70B Q4, Qwen3.5-122B-A10B Q4, and HunyuanVideo at bf16 / fp8 with headroom.

Hardware

Component	Detail
GPUs	2x NVIDIA GeForce RTX 5090 32 GB GDDR7 (575 W, PCIe 5.0 x16, Blackwell)
VRAM pool	64 GB
CPU	AMD EPYC 7643 Milan (48C/96T, 225 W, 128x PCIe 4.0 lanes)
Motherboard	ASRock Rack ROMED8-2T (SP3, 7x PCIe 4.0 x16, 8x DDR4 ECC, 2x 10 GbE, IPMI)
System RAM	128 GB DDR4-2666 ECC RDIMM (2x 64 GB)
Boot / storage	1 TB NVMe M.2 (PCIe 4.0 x4)
Power supply	Single 2 kW ATX PSU
Chassis	4U rack-mount, passive Gen4 x16 risers
Cooling	SP3 tower cooler, 3x 120 mm front intake + 1x 120 mm rear exhaust (industrial fans)
Network	Onboard dual 10 GbE (Intel X550) + IPMI

Power envelope

GPU draw: 2 x 575 W = 1 150 W
System total at full load: ~1 475 W
PSU total: 2 000 W (single 2 kW ATX) — 26.25 % headroom
Workable single-PSU margin; dual-PSU upgrade available for extra headroom

Lane topology

ROMED8-2T fans out 2x16 Gen4 from CPU root complex. 5090 is Gen5 silicon running Gen4 x16 without bandwidth penalty for inference. No PCIe switch. No NVLink on GeForce 5090 — tensor-parallel 2-way P2P uses PCIe.

What you can run

With 64 GB of pooled GDDR7 VRAM across 2 Blackwell cards, this server handles 70B Q4 tensor-parallel, MoE flagships, native fp8 image generation, video AI, and multi-model concurrent serving.

LLMs — text / reasoning / coding

Chinese frontier

Qwen3-32B Q8 / bf16 (near-fp16 quality) (~40-55 tok/s single-stream on Blackwell fp8, published reference)
QwQ-32B bf16; Qwen3-30B-A3B / Coder-30B-A3B bf16 (~60 GB fits)
Qwen3.5-122B-A10B Q4 (~70-75 GB with RAM spill) — MoE flagship at Q4 fits
Hunyuan-A13B fp8 (~80 GB tight) or Q6 (~36 GB comfortable)
Seed-OSS-36B bf16 (~72 GB tight — prefer fp8 ~36 GB)
DeepSeek-R2 32B sparse MoE bf16
GLM-4.5-Air 106B/12B Q4_K_M (~60 GB) — MoE with headroom
ERNIE-4.5-47B-A3B Q6-Q8

Western frontier

Llama 3.3 70B Q4_K_M (~43 GB) — the headline workload for this tier (~20-28 tok/s single-stream on 2x 5090, published reference)
Hermes 3 70B / Tulu 3 70B Q4 — open post-training Llama derivatives
Mistral Small 3 / Magistral / Devstral Small 2 24B bf16; Mixtral 8x7B bf16
Gemma 3 27B multimodal bf16 + reasoning headroom
Phi-4 14B bf16; Nemotron-Super 49B Q6-Q8
gpt-oss-20b MXFP4 (16 GB) + gpt-oss-120b MXFP4 (80 GB — fits tight with short ctx)
OLMo 2 32B / OLMo 3.1-32B-Think bf16

Vision-Language

Qwen3-VL-32B / Qwen3-VL-30B-A3B / Qwen3-Omni-30B-A3B bf16; InternVL3.5-38B bf16; Llama 3.2 90B Vision Q4 (~52 GB); Pixtral 12B bf16; Pixtral Large 124B Q3 (~58 GB tight); Gemma 3 27B multimodal bf16; PaliGemma 2 28B bf16; Molmo 72B Q4 (~45 GB).

Image generation

5090 native fp8 is the speed story — FLUX.1 / SD 3.5 / HunyuanImage run materially faster than on Ada: FLUX.1 [dev] / [schnell] fp8 native (~12 GB) with 2x parallel across cards (~8-12 seconds per 1024x1024 image on Blackwell, published reference); FLUX.1 Kontext [dev]; SD 3.5 Large (18 GB fp16 or 11 GB fp8); SDXL 1.0; HunyuanImage-2.1 bf16 (~34 GB); HunyuanImage-3.0 NF4; AuraFlow v0.3 / OmniGen v1 / Kolors 2.0.

Video generation

Wan 2.2 T2V-A14B / I2V-A14B bf16 (~54 GB total) — MoE two-expert at full precision; Wan 2.2 TI2V-5B bf16 per-card, 2 parallel tenants; HunyuanVideo 13B Q4-Q5 (~30 GB), fp8 tight; HunyuanVideo 1.5 (8.3B) bf16 per-card; Open-Sora 2.0 (11B) bf16; CogVideoX-5B / 1.5 bf16; Mochi-1 bf16 (~42 GB fits); LTX-Video 2B; NVIDIA Cosmos Predict 2.

Audio / Speech / TTS

Same full Chinese + Western speech stack as the 4090 tier fits with more headroom: Whisper v3 + Parakeet + Canary + Moshi + Step-Audio 2 / R1 + CosyVoice 3.0 + Kokoro + Stable Audio Open + MusicGen + AudioGen + SeamlessM4T v2 + MMS. On fp8-native 5090, Whisper / Parakeet decode at materially higher real-time factor. Whisper v3 turbo runs at ~75x realtime on Blackwell (published reference).

Multi-model / multi-tenant

Resident stack: Llama 3.3 70B Q4 (~43 GB tensor-parallel 2-way) + FLUX.1 fp8 (~12 GB) + Whisper-turbo + Moshi
2-4 concurrent tenants on 32B class at Q6-Q8 per card
LoRA / QLoRA fine-tuning of 7-14B comfortable, 24-32B tight

Target workloads

Small-team developer workstation with 70B Q4 serving headroom
Blackwell step-up from a 2x RTX 4090 box — same chassis, ~2.5x TOPS, fp8 native
Image / video generation workstation with FLUX native fp8 speedup
Multi-model concurrent box: 70B Q4 + FLUX + Whisper + Moshi resident simultaneously
4-8 concurrent user inference endpoint for 32B class LLMs

Published performance references

Published reference | 2x RTX 5090 comparable hardware

Benchmark	Result
Llama 3.3 70B Q4_K_M llama.cpp decode	~20-28 tok/s single-stream
Qwen3-32B Q8 vLLM single-stream	~45-60 tok/s decode at fp8
FLUX.1 [dev] fp8 native Blackwell	~1.5-1.9 s per 1024x1024 at 20 steps
HunyuanVideo 13B Q5 TP-2	5 s 720p in ~5-7 min

Published, not measured on Kentino hardware. Kentino measured reference on 4x RTX 4090: 647 TFLOPS fp16, 179 tok/s batch-32 aggregate.

Not ideal for

100B+ dense models at bf16 (DeepSeek-V3, Kimi K2, Mistral Large 3 — need 256+ GB pool)
Frontier video generation at bf16 long-form full-resolution

Warranty and lead time

2 years

parts warranty

1 year

labor warranty

10-28 days

lead time

Build includes assembly, BIOS configuration, driver install, burn-in testing, and functional verification. Lead time depends on component availability, confirmed at order.

Recommended add-ons

NVIDIA ConnectX-5 100 GbE MCX555A-ECAT
Upgrade boot drive to 2 TB NVMe — or 4 TB
Upgrade RAM to 256 GB (4x 64 GB) — MoE KV cache headroom / multi-model concurrent serving
Rack PDU (C13/C19 metered) and 3 kVA online UPS