Kentino s.r.o.
K-AI 48 Rome 4090 1322TOPS — 2x RTX 4090 Entry AI Server
K-AI 48 Rome 4090 1322TOPS — 2x RTX 4090 Entry AI Server
Tilgængelighed for afhentning kunne ikke indlæses
K-AI 48 Rome 4090 1322TOPS
48 GB VRAM Entry 2-GPU Server
2x RTX 4090 | EPYC Rome | 1 322 TOPS INT8
48 GB VRAM pool across two RTX 4090 — the cost-floor for 32B-class tensor-parallel inference.
A two-GPU Ada workstation-class AI server built on ROMED8-2T / EPYC Rome. Two RTX 4090 give a 48 GB pooled VRAM envelope that comfortably runs 32B dense Q6-Q8, Hunyuan-A13B at Q6, Wan 2.1 14B video, and Pixtral 12B vision — the best all-round model selection per Euro the Kentino lineup offers, before stepping up to Blackwell.
Hardware
| Component | Detail |
|---|---|
| GPUs | 2x NVIDIA GeForce RTX 4090 24 GB GDDR6X (450 W, PCIe 4.0 x16) |
| VRAM pool | 48 GB (no NVLink — tensor-parallel over PCIe) |
| CPU | AMD EPYC 7542 Rome (32C/64T, 225 W, 128x PCIe 4.0 lanes) |
| Motherboard | ASRock Rack ROMED8-2T (SP3, 7x PCIe 4.0 x16, 8x DDR4 ECC, 2x 10 GbE, IPMI) |
| System RAM | 128 GB DDR4-2666 ECC RDIMM (2x 64 GB) |
| Boot / storage | 1 TB NVMe M.2 (PCIe 4.0 x4) |
| Power supply | Single 2 kW ATX PSU |
| Chassis | 4U rack-mount, passive Gen4 x16 risers |
| Cooling | SP3 tower cooler, 3x 120 mm front intake + 1x 120 mm rear exhaust |
| Network | Onboard dual 10 GbE (Intel X550) + IPMI |
Power envelope
- GPU draw: 2 x 450 W = 900 W
- System total at full load: ~1 225 W
- PSU total: 2 000 W (single 2 kW ATX) — 38.75 % headroom
- Comfortable single-PSU margin
Lane topology
ROMED8-2T fans out 2x16 directly from CPU root complex — no PLX switch. Consumer 4090 has no NVLink; tensor-parallel communicates over PCIe. PCIe Gen4 x16 at both GPUs.
What you can run
With 48 GB of pooled VRAM across 2 cards, this server handles 32B-class dense LLMs at Q6-Q8, MoE flagships, image and video generation, speech AI, and multi-tenant serving.
LLMs — text / reasoning / coding
Chinese frontier
- Qwen3-32B dense Q6-Q8 (~25-35 tok/s single-stream on 2x 4090, published reference); QwQ-32B Q6; Qwen3.5-27B Q6-Q8
- Qwen3-30B-A3B / Qwen3-Coder-30B-A3B bf16 (~60 GB tight; use Q6)
- Hunyuan-A13B Q6 or fp8 (~48 GB) — 80B/13B MoE, 256k ctx
- Seed-OSS-36B Q6 — 512k native ctx
- DeepSeek-R2 32B sparse MoE bf16 (~64 GB tight — prefer Q6 ~45 GB) (~30-40 tok/s single-stream at Q4, published reference)
- ERNIE-4.5-47B-A3B Q4 (~28 GB with headroom) / Q6 (~42 GB)
Western frontier
- Llama 3.3 70B Q4_K_M (~43 GB) tensor-parallel 2-way — the sweet spot of this class (~14-17 tok/s single-stream on 2x 4090, published reference)
- Llama 4 Scout 109B/17B MoE Q3_K (~51 GB tight)
- Mistral Small 3 / Magistral Small / Devstral Small 2 (24B) bf16
- Mixtral 8x7B Q6
- Gemma 3 27B bf16; Phi-4 14B bf16
- Nemotron-Super 49B Q4 (~28 GB)
- Others: OLMo 2 32B; Reka Flash 3 21B bf16; Falcon H1R 7B
Vision-Language
Qwen3-VL-32B / Qwen3-VL-30B-A3B MoE / Qwen3-Omni-30B-A3B; InternVL3-38B Q4-Q5; InternVL3.5-38B; DeepSeek-VL2; ERNIE-4.5-VL-28B-A3B-Thinking; Llama 3.2 11B Vision bf16; Pixtral 12B bf16; Gemma 3 27B multimodal; PaliGemma 2 28B Q4; MiniCPM-V 2.6 / MiniCPM-o 2.6.
Image generation
FLUX.1 [dev] / [schnell] fp16 (24 GB) or fp8 (~12 GB) with generous batch (~15-25 seconds per 1024x1024 image at fp8 per card, published reference); FLUX.1 Kontext [dev]; SD 3.5 Large (18 GB fp16); SDXL 1.0 + ControlNet + AnimateDiff; HunyuanImage-2.1 bf16 (~34 GB fits in pool); AuraFlow v0.3 / OmniGen v1 / Kolors 2.0.
Video generation
Wan 2.1 14B T2V/I2V Q6/fp8; Wan 2.2 TI2V-5B bf16 single-card; Wan 2.2 T2V-A14B / I2V-A14B Q4 (~32 GB); HunyuanVideo 13B Q4-Q5 (~30 GB); HunyuanVideo 1.5 (8.3B) bf16; Open-Sora 2.0 (11B) Q8; CogVideoX-5B / 1.5 bf16; Mochi-1 Q4-Q8; LTX-Video 2B; Pyramid Flow 2B.
Audio / Speech / TTS
Full 24 GB tier stack fits with room for concurrent use: Whisper v3 large + Parakeet-TDT + Canary 1B + Moshi + Step-Audio 2 mini + CosyVoice 3.0 + Kokoro 82M + Stable Audio Open all residable simultaneously. Whisper v3 turbo runs at ~50x realtime on a single card (published reference).
Multi-model / multi-tenant
- 2-4 concurrent users on 32B Q6 class LLMs via vLLM tensor-parallel
- Mixed workload: Qwen3-32B Q6 (~20 GB) + FLUX.1 fp8 (~12 GB) + Whisper-turbo (1.6 GB) + Moshi (8 GB) resident across 2 cards
- LoRA / QLoRA fine-tuning of 7-14B models comfortably, 24-32B tight
Target workloads
- Two-operator AI workstation with mixed LLM + image + audio stacks
- 32B-class serving endpoint for small-team developer environment (4-8 concurrent users on Qwen3-32B / Gemma 3 27B)
- Image generation pipeline (FLUX.1 + SD 3.5 + ControlNet) batch production
- Video-gen development box (Wan 2.1 / Wan 2.2 TI2V / HunyuanVideo 1.5)
- LoRA / QLoRA fine-tuning research box for 7-34B Chinese + Western weights
Published performance references
Published reference | 2x RTX 4090 comparable hardware
| Benchmark | Result |
|---|---|
| Llama 3.3 70B Q4_K_M llama.cpp decode | ~14-17 tok/s single-stream |
| Qwen3-32B Q6 vLLM single-stream | ~35-45 tok/s decode |
| FLUX.1 [dev] fp8 | ~2.5-3.0 s per 1024x1024 at 20 steps |
| vLLM batch-32 aggregate (extrapolated from 4x4090) | ~90 tok/s aggregate |
Published reference points from comparable 2x4090 hardware. Not measured on Kentino hardware.
Not ideal for
- 70B dense at Q6+ (needs 96 GB pool — step up to 4x RTX 4090 or 4x RTX 5090)
- Frontier 100B+ MoE at bf16 (GLM-4.5, Kimi K2, Mistral Large 3)
Warranty and lead time
Build includes assembly, BIOS configuration, driver install, burn-in testing, and functional verification. Lead time depends on component availability, confirmed at order.
Recommended add-ons
- NVIDIA ConnectX-5 100 GbE MCX555A-ECAT
- Upgrade boot drive to 2 TB NVMe
- Upgrade RAM to 256 GB (4x 64 GB) — more KV cache headroom for long-ctx MoE
- Rack PDU (C13/C19 metered) and 2 kVA online UPS
Share
