Kentino s.r.o.
K-AI 384 Rome RTXPro6000 — 4× RTX Pro 6000 Blackwell Server Edition (384 GB ECC VRAM)
K-AI 384 Rome RTXPro6000 — 4× RTX Pro 6000 Blackwell Server Edition (384 GB ECC VRAM)
Dostupnost vyzvednutí nebylo možné načíst
K-AI 384 Rome RTXPro6000 8000TOPS
384 GB ECC VRAM Datacenter Server
4x RTX Pro 6000 Server Edition | EPYC Milan | 8 000 TOPS INT8
Published external references. Not measured on Kentino hardware.
A 4U rack-mount inference server with four NVIDIA RTX Pro 6000 Blackwell Server Edition passive datacenter cards (96 GB ECC each) pooled to 384 GB ECC VRAM, one AMD EPYC 7643 Milan CPU (48C/96T), 384 GB DDR4-2666 ECC, 2 TB NVMe boot, and dual synchronized 2.5 kW ATX PSU. Blackwell silicon with fp8 native acceleration. Passive airflow-directed cooling for datacenter chassis. Runs DeepSeek V3 Q3, Mistral Large 3, Qwen3-Coder-480B, and every major frontier open-weight model.
Hardware
| Component | Detail |
|---|---|
| GPUs | 4x NVIDIA RTX Pro 6000 Blackwell Server Edition 96 GB ECC (passive datacenter cooler, 600 W TGP, PCIe 5.0 x16, 2000 INT8 TOPS/card, fp8 native) |
| VRAM pool | 384 GB aggregate ECC across 4 cards |
| CPU | AMD EPYC 7643 Milan (48C/96T, 225 W, 128x PCIe 4.0 lanes) |
| Motherboard | ASRock Rack ROMED8-2T (SP3, 7x PCIe 4.0 x16, 8x DDR4 ECC, 2x 10 GbE, IPMI) |
| System RAM | 384 GB DDR4-2666 ECC RDIMM (6x 64 GB — 2 DIMM slots open for upgrade to 512 GB) |
| Boot / storage | 2 TB NVMe M.2 (PCIe 4.0 x4) |
| Power supply | 2x 2.5 kW ATX with dual-PSU sync cable (5 kW aggregate) |
| Chassis | 4U rack-mount |
| Cooling | SP3 tower cooler (Arctic Freezer 4U-M class) + front-to-back directed airflow (3x 120 mm front intake + 1x 120 mm rear exhaust). Passive GPU cards — requires datacenter chassis airflow. |
| Network | Onboard dual 10 GbE (Intel X550) |
Power envelope
- GPU draw: 4 x 600 W = 2 400 W
- System total under full load: ~2 775 W
- PSU total: 5 000 W (dual 2.5 kW synced) — 44.5% headroom
- Dual PSU for split power delivery — single PSU failure = loss of 2 GPUs or 2 GPUs + motherboard
Lane topology
ROMED8-2T exposes 7x PCIe 4.0 x16 direct from EPYC Milan. Four slots populated — three free for NIC / storage / telemetry. RTX Pro 6000 is Gen5-capable silicon; runs Gen4 at full x16 on this platform — no bandwidth bottleneck for inference. No PCIe switch. No NVLink.
What you can run
With 384 GB of pooled ECC VRAM on Blackwell fp8 native silicon, this server runs DeepSeek V3 / R1 at Q3 comfortably on-card, Mistral Large 3 Q3, GLM-5 Q3, Qwen3-Coder-480B Q3, and Llama 3.3 70B bf16 resident on a single card (96 GB/card).
LLMs — text / reasoning / coding
Chinese frontier
- DeepSeek V3 / V3-0324 / V3.1 / V3.2 / R1 / R1-0528 Q3 (~290 GB) comfortably on-card (~30-40 tok/s single, published reference); fp8 native (~670 GB) with RAM spill
- Qwen3-Coder-480B-A35B Q3 (~350 GB tight with RAM spill) — SOTA open coding agent (~18-25 tok/s single, published reference)
- Qwen3-235B-A22B Q6/Q8 (~200-280 GB) with very long ctx and multi-user batching
- GLM-5 / GLM-5.1 Q3 (~317 GB) — Chinese frontier, close to Claude Opus 4.6 on coding
- Kimi-K2 1.58-bit UD (~240 GB) — trillion-param agent at real throughput
- Hunyuan-Large 389B/52B Q4 (~220 GB), fp8 native (~390 GB spill)
- ERNIE-4.5-424B-A47B Q4 (~240 GB); MiniMax-M1 Q4 (~260 GB) 1M-ctx
- Llama 3.3 70B bf16 resident on a single card (96 GB/card — no tensor-parallel needed)
Western frontier
- Mistral Large 3 (675B/41B MoE, Apache 2.0) Q3 (~317 GB) — frontier Western open weights (~20-30 tok/s single, published reference)
- Llama 4 Maverick (400B/17B) Q4 (~232 GB) with generous KV budget (~45-55 tok/s single, published reference)
- Llama-3.1-Nemotron Ultra 253B Q4-Q6 (~119-207 GB)
- gpt-oss-120b MXFP4 native (80 GB) with massive concurrent fleet headroom
- Pixtral Large / Mistral Large 2 bf16 (~248 GB); Devstral 2 123B bf16 — 256k top open coding
- Llama 3.3 70B bf16 on a single card; 4x concurrent 70B deployments possible
Vision-Language Models
Qwen3-VL-235B-A22B bf16 (~240 GB); InternVL3.5-241B-A28B Q4 (~135 GB); Llama 3.2 90B Vision bf16; Pixtral Large 124B bf16 (~248 GB); Qwen3-Omni-30B-A3B; Molmo 72B; ERNIE-4.5-VL; GLM-4.6V 106B bf16 on TP. Blackwell fp8 delivers ~2x throughput on vision-tower inference vs Ada.
Image generation
FLUX.1 [dev] / Kontext / Tools at fp8 native (~15-20 s per 1024x1024 image on single RTX Pro 6000, published reference); SD 3.5 Large; HunyuanImage-2.1 (17B native 2K); HunyuanImage-3.0 80B/13B MoE; AuraFlow; OmniGen; 4x concurrent ComfyUI workers.
Video generation
Wan 2.2 T2V-A14B / I2V-A14B dual expert bf16; HunyuanVideo 13B bf16 both experts; Open-Sora 2.0 (11B) bf16; CogVideoX-5B; Mochi-1; LTX-Video; Pyramid Flow; SVD / SV3D / SV4D; NVIDIA Cosmos Predict 2.
Audio / Speech / TTS
- ASR: Whisper v3 large / turbo; Parakeet-TDT 1.1B; Canary 1B; Qwen3-ASR; SenseVoice
- TTS: CosyVoice 2/3; Kokoro; Stable Audio Open; XTTS v2; Step-Audio-EditX
- Realtime / S2S: Kyutai Moshi; Step-Audio 2 mini / R1; Qwen2.5-Omni-7B
- Music / SFX: MusicGen / AudioGen / Bark / SeamlessM4T
Multi-model / multi-tenant serving
- DeepSeek V3 Q3 + concurrent 70B + FLUX.1 + Whisper all resident
- 4-way tensor-parallel on 350-400B class at Q4
- Per-card tenant isolation — one 96 GB Llama 3.3 70B bf16 per card, 4 independent inference silos
- Multi-model RAG: reader + reranker + vision + embedder all on one host
Target workloads
- Frontier open-weight inference backend — DeepSeek V3 Q3, Qwen3-Coder-480B Q3, GLM-5 Q3
- Production serving of Llama 4 Maverick Q4 multimodal agents with generous context budget
- 4-tenant per-card isolation — one Llama 3.3 70B bf16 per tenant, zero cross-contamination
- fp8-native DeepSeek / R1 / Hunyuan serving on Blackwell silicon
- Mistral Large 3 Q3 as Western Apache-2.0 frontier open-weight alternative
Published performance references
External references | Not measured on Kentino hardware
| Benchmark | Result |
|---|---|
| RTX Pro 6000 per-card INT8 TOPS | 2 000 TOPS |
| RTX Pro 6000 memory bandwidth | ~1 800 GB/s per card |
| vLLM — DeepSeek V3 Q3 on 4x Blackwell PCIe (single) | ~30-40 tok/s |
| vLLM — DeepSeek V3 Q3 on 4x Blackwell PCIe (batch-8) | ~200 tok/s aggregate |
| SGLang — Llama 4 Maverick Q4 on 4x Blackwell (single) | ~45-55 tok/s |
| llama.cpp — Qwen3-Coder-480B Q3 on 4x Blackwell (single) | ~18-25 tok/s |
| FLUX.1 [dev] fp8 on single RTX Pro 6000 | ~1.8 s per 1024x1024 image |
Kentino will publish first-party numbers after initial customer build.
Not ideal for
- Single-user workloads up to 70B — 4x RTX 5090 is materially cheaper for a 128 GB pool if ECC and passive reliability are not required
- Silent lab / office-adjacent deployment — passive cooler requires proper datacenter front-to-back airflow. For acoustic-sensitive sites choose the Max-Q turbofan variant (K-AI 384 Rome RTXPro6000MQ)
- Frontier training from scratch (no NVLink)
- Full DeepSeek V3 Q4 on-card (~404 GB) — upgrade to 6x RTX Pro 6000 / 576 GB
Warranty and lead time
Build includes assembly, BIOS configuration, driver install, burn-in, memtest, and functional verification. Lead time depends on component availability, confirmed at order.
Recommended add-ons
- Upgrade RAM to 512 GB DDR4 (add 2x 64 GB — 2 DIMM slots open) for RAM-spill headroom on Q3 frontier quants
- 4 TB NVMe Gen4 x4 for frontier-model library (DeepSeek V3 Q3 alone is ~290 GB on disk)
- Full 24U rack cabinet with managed PDU + online UPS
- Alternative silhouette: Max-Q turbofan variant (K-AI 384 Rome RTXPro6000MQ) — same silicon, quieter blower cooler, for lab deployments
Share
