Kentino s.r.o.
K-AI 32 Rome 5090 1676TOPS — 1x RTX 5090 AI Workstation
K-AI 32 Rome 5090 1676TOPS — 1x RTX 5090 AI Workstation
Kan beschikbaarheid voor afhalen niet laden
K-AI 32 Rome 5090 1676TOPS
Single-GPU Blackwell Workstation
1x RTX 5090 | EPYC Milan | 1 676 TOPS INT8
Single Blackwell GPU, 32 GB GDDR7, fp8 native — the sharpest single-card AI workstation Kentino builds.
A single-GPU, workstation-class AI server on the ROMED8-2T / EPYC Milan platform. One RTX 5090 delivers 32 GB of GDDR7 VRAM with native fp8 tensor math — the sweet spot for a developer box, a small-team inference endpoint, or an image/video generation workstation where one strong GPU beats two weaker ones. 4U rack form factor, but drop-in for a quiet office under-desk deployment.
Hardware
| Component | Detail |
|---|---|
| GPU | 1x NVIDIA GeForce RTX 5090 32 GB GDDR7 (575 W, PCIe 5.0 x16, Blackwell) |
| VRAM pool | 32 GB |
| CPU | AMD EPYC 7643 Milan (48C/96T, 225 W, 128x PCIe 4.0 lanes) |
| Motherboard | ASRock Rack ROMED8-2T (SP3, 7x PCIe 4.0 x16, 8x DDR4 ECC, 2x 10 GbE, IPMI) |
| System RAM | 128 GB DDR4-2666 ECC RDIMM (2x 64 GB) |
| Boot / storage | 1 TB NVMe M.2 (PCIe 4.0 x4) |
| Power supply | Single 2 kW ATX PSU |
| Chassis | 4U rack-mount, passive Gen4 x16 riser |
| Cooling | SP3 tower cooler (Arctic Freezer 4U-M class), 3x 120 mm front intake + 1x 120 mm rear exhaust |
| Network | Onboard dual 10 GbE (Intel X550) + IPMI |
Power envelope
- GPU draw: 1 x 575 W = 575 W
- System total at full load: ~900 W
- PSU total: 2 000 W (single 2 kW ATX) — 55 % headroom
- Generous transient margin, silent operation at light load
Lane topology
PCIe Gen4 x16 at the GPU (ROMED8-2T is Gen4; 5090 is Gen5 silicon running Gen4 without bandwidth penalty for inference). 16 lanes direct from CPU root complex. No PCIe switch. No NVLink on GeForce 5090.
What you can run
With 32 GB of GDDR7 VRAM and native fp8 tensor math, this workstation handles open-weight LLMs up to 32B dense, image generation with FLUX.1, video generation, speech AI, and single-developer multi-model stacks.
LLMs — text / reasoning / coding
Chinese frontier
- Qwen3-32B dense Q6_K — 32k context, flagship general reasoning (~40-55 tok/s single-stream on Blackwell fp8, published reference)
- Qwen3-30B-A3B MoE at Q4_K_M with long KV headroom (Qwen3-Coder-30B-A3B agentic, 256k ctx)
- QwQ-32B Q6 — reasoning preview
- DeepSeek-R2 32B sparse MoE at Q4-Q6 — single-GPU reasoning that scores 92.7 % AIME-2025 (~45-60 tok/s single-stream on Blackwell fp8, published reference)
- Qwen3.5-27B dense Q6 (Feb 2026 release)
- Hunyuan-A13B at Q4_K_M (~28-30 GB) — 80B/13B MoE, 256k ctx, dual-mode reasoning
- Seed-OSS-36B Q4_K_M — 512k native context for long-doc analysis
Western frontier
- Llama 3.3 70B at Q2_K (~27 GB tight) or Q3_K (~34 GB with RAM spill) — usable for general chat
- Mistral Small 3 / Magistral Small / Devstral Small 2 (24B dense) at Q6-Q8 or bf16
- Gemma 3 27B multimodal at Q6 with 128k context
- Phi-4 14B / Phi-4-reasoning bf16
- Reka Flash 3 (21B Apache 2.0) at bf16
- gpt-oss-20b native MXFP4 (~16 GB — fits with generous KV)
Vision-Language
Qwen3-VL-8B / -32B at Q4-Q6; Qwen3-VL-30B-A3B MoE; InternVL3.5-8B / -38B Q4; MiniCPM-V 2.6 / MiniCPM-o 2.6 (8B); Llama 3.2 11B Vision bf16; Pixtral 12B bf16 (24 GB — tight, use Q8); Gemma 3 12B / 27B multimodal; PaliGemma 2 (3/10B); Phi-4-multimodal 5.6B; Aya Vision 8B.
Image generation
FLUX.1 [dev] / [schnell] fp8 (~12 GB) native Blackwell speedup (~8-12 seconds per 1024x1024 image at 20 steps on Blackwell, published reference); FLUX.1 Kontext [dev] — in-context editing, character consistency; SD 3.5 Large (18 GB fp16 / 11 GB fp8); SDXL 1.0 10-12 GB fp16; HunyuanImage-2.1 NF4 (~14 GB); Kolors 2.0 fp8; AuraFlow v0.3 / OmniGen v1 / PixArt-Sigma.
Video generation
Wan 2.2 TI2V-5B at ~16 GB — 720p@24fps on a single 5090; Wan 2.1 T2V/I2V 14B at Q4-Q6 (~16 GB); HunyuanVideo 1.5 (8.3B) — 14 GB minimum; CogVideoX-5B / 5B-I2V int8 (~12 GB); LTX-Video 2B realtime-class 30 fps; Mochi-1 Q4 (~17-18 GB).
Audio / Speech / TTS
- ASR: Whisper v3 large / turbo (~50x realtime on single GPU, published reference); NVIDIA Parakeet-TDT 1.1B; Canary 1B
- TTS: CosyVoice 2.0 / Fun-CosyVoice 3.0; Kokoro 82M; Stable Audio Open
- Realtime / S2S: Kyutai Moshi (7B) — only open realtime full-duplex voice; Step-Audio 2 mini / R1
Multi-model / multi-tenant
- Resident stack for a single developer: Qwen3-32B Q6 (~20 GB) + FLUX.1 fp8 (~12 GB fits tight) on swap, or Qwen3-14B Q6 (~9 GB) + FLUX.1 + Whisper-turbo + Kokoro simultaneously (~20-24 GB pinned)
- 2-4 concurrent users on 14-32B class LLMs via vLLM / SGLang
- LoRA / QLoRA fine-tuning of 7-14B dense models
Target workloads
- Developer workstation for a single AI engineer running mixed inference + image gen
- Small-team coding-agent endpoint (Qwen3-Coder-30B-A3B) with 1-4 concurrent users
- Content pipeline: FLUX.1 or SD 3.5 Large batch image gen + Wan 2.2 short-form video
- On-premises ASR + TTS voice stack (Whisper + Kokoro + Moshi) for a branch office
- Prosumer LLM + VLM research box — test Qwen3, Llama 3.3, Gemma 3, Phi-4 on real hardware
Published performance references
Published reference | single RTX 5090 comparable hardware
| Benchmark | Result |
|---|---|
| Llama 3.3 70B Q4_K_M llama.cpp decode | ~18-22 tok/s with CPU KV offload |
| Qwen3-32B Q6 vLLM single-stream | ~45-55 tok/s decode at fp8 |
| FLUX.1 [dev] fp8 on Blackwell | ~1.7-2.0 s per 1024x1024 image at 20 steps |
| Wan 2.2 TI2V-5B 720p clip | ~3-4 minutes at fp16 |
Published reference points from comparable single-5090 hardware. Kentino measured numbers will be posted once gf-logic extends bench to single-5090.
Not ideal for
- 70B dense models at Q6+ (32 GB is insufficient — use 2x 5090 for proper 64 GB pool)
- Multi-user concurrent serving at scale (single tensor-parallel partition)
- Frontier 100B+ MoE (GLM-4.5, Kimi K2, Mistral Large 3 — out of reach on a single consumer card)
Warranty and lead time
Build includes assembly, BIOS configuration, driver install, burn-in testing, and functional verification. Lead time depends on component availability, confirmed at order.
Recommended add-ons
- NVIDIA ConnectX-5 100 GbE MCX555A-ECAT
- Upgrade boot drive to 2 TB NVMe — or 4 TB
- Upgrade RAM to 256 GB (4x 64 GB DDR4) for bigger KV cache / multi-model concurrent stacks
- Rack PDU (C13/C19 metered) and 2 kVA online UPS
Share
