Kentino s.r.o.
K-AI 384 Rome RTXPro6000MQ — 4× RTX Pro 6000 Blackwell Max-Q Turbofan (384 GB ECC VRAM)
K-AI 384 Rome RTXPro6000MQ — 4× RTX Pro 6000 Blackwell Max-Q Turbofan (384 GB ECC VRAM)
Δεν ήταν δυνατή η φόρτωση της διαθεσιμότητας παραλαβής
K-AI 384 Rome RTXPro6000MQ 8000TOPS
384 GB ECC VRAM Lab Server
4x RTX Pro 6000 Max-Q Turbofan | EPYC Milan | 8 000 TOPS INT8
Published external references. Not measured on Kentino hardware.
A 4U rack-mount inference server with four NVIDIA RTX Pro 6000 Blackwell Max-Q turbofan (blower) cards (96 GB ECC each) pooled to 384 GB ECC VRAM, one AMD EPYC 7643 Milan CPU (48C/96T), 384 GB DDR4-2666 ECC, 2 TB NVMe boot, and dual synchronized 2.5 kW ATX PSU. Same Blackwell silicon as the Server Edition — identical inference envelope, identical throughput — with a quieter blower cooler suited to lab, R&D, and office-adjacent environments.
Hardware
| Component | Detail |
|---|---|
| GPUs | 4x NVIDIA RTX Pro 6000 Blackwell Max-Q 96 GB ECC (turbofan / blower cooler, 600 W TGP, PCIe 5.0 x16, 2000 INT8 TOPS/card, fp8 native) |
| VRAM pool | 384 GB aggregate ECC across 4 cards |
| CPU | AMD EPYC 7643 Milan (48C/96T, 225 W, 128x PCIe 4.0 lanes) |
| Motherboard | ASRock Rack ROMED8-2T (SP3, 7x PCIe 4.0 x16, 8x DDR4 ECC, 2x 10 GbE, IPMI) |
| System RAM | 384 GB DDR4-2666 ECC RDIMM (6x 64 GB — 2 DIMM slots open for upgrade to 512 GB) |
| Boot / storage | 2 TB NVMe M.2 (PCIe 4.0 x4) |
| Power supply | 2x 2.5 kW ATX with dual-PSU sync cable (5 kW aggregate) |
| Chassis | 4U rack-mount |
| Cooling | SP3 tower cooler (Arctic Freezer 4U-M class) + front-to-back directed airflow (3x 120 mm front intake + 1x 120 mm rear exhaust). GPU cards self-cooled via turbofan blower (rear exhaust) — quieter for lab environments. |
| Network | Onboard dual 10 GbE (Intel X550) |
Power envelope
- GPU draw: 4 x 600 W = 2 400 W
- System total under full load: ~2 775 W
- PSU total: 5 000 W (dual 2.5 kW synced) — 44.5% headroom
- Dual PSU for split power delivery — single PSU failure = loss of 2 GPUs or 2 GPUs + motherboard
Thermal profile (Max-Q)
Max-Q uses a turbofan (blower) cooler with directional rear-of-card exhaust. Expected GPU hotspot 72-80 C under continuous load. Materially quieter than passive cards in a high-static-pressure chassis. Better suited to non-datacenter airflow, open-rack, or lab / office-adjacent placement. Silicon, TDP, ECC, and performance are identical to the Server Edition.
What you can run
Identical to the Server Edition (K-AI 384 Rome RTXPro6000) — same Blackwell silicon, same 384 GB ECC pool, same fp8 native, same model compatibility. The difference is acoustic, not computational.
LLMs — text / reasoning / coding
Chinese frontier
- DeepSeek V3 / V3-0324 / V3.1 / V3.2 / R1 / R1-0528 Q3 (~290 GB) comfortably on-card (~30-40 tok/s single, published reference); fp8 native (~670 GB) with RAM spill
- Qwen3-Coder-480B-A35B Q3 (~350 GB tight with RAM spill) — SOTA open coding agent (~18-25 tok/s single, published reference)
- Qwen3-235B-A22B Q6/Q8 (~200-280 GB) with long ctx and multi-user batching
- GLM-5 / GLM-5.1 Q3 (~317 GB) — Chinese frontier, close to Claude Opus 4.6 on coding
- Kimi-K2 1.58-bit UD (~240 GB) — trillion-param agent at real throughput
- Hunyuan-Large 389B/52B Q4 (~220 GB), fp8 native (~390 GB spill)
- ERNIE-4.5-424B-A47B Q4 (~240 GB); MiniMax-M1 Q4 (~260 GB) 1M-ctx
- Llama 3.3 70B bf16 resident on a single card (96 GB/card)
Western frontier
- Mistral Large 3 (675B/41B MoE, Apache 2.0) Q3 (~317 GB) — frontier Western open weights (~20-30 tok/s single, published reference)
- Llama 4 Maverick (400B/17B) Q4 (~232 GB) with generous KV budget (~45-55 tok/s single, published reference)
- Llama-3.1-Nemotron Ultra 253B Q4-Q6 (~119-207 GB)
- gpt-oss-120b MXFP4 native (80 GB) with concurrent fleet headroom
- Pixtral Large / Mistral Large 2 bf16 (~248 GB); Devstral 2 123B bf16 — 256k top open coding
- Llama 3.3 70B bf16 on a single card; 4x concurrent 70B deployments possible
Vision-Language Models
Qwen3-VL-235B-A22B bf16 (~240 GB); InternVL3.5-241B-A28B Q4 (~135 GB); Llama 3.2 90B Vision bf16; Pixtral Large 124B bf16; Qwen3-Omni-30B-A3B; Molmo 72B; ERNIE-4.5-VL; GLM-4.6V 106B bf16 on TP. Blackwell fp8 delivers ~2x throughput on vision-tower inference vs Ada.
Image generation
FLUX.1 [dev] / Kontext / Tools at fp8 native (~15-20 s per 1024x1024 image on single RTX Pro 6000, published reference); SD 3.5 Large; HunyuanImage-2.1 (17B native 2K); HunyuanImage-3.0 80B/13B MoE; AuraFlow; OmniGen; 4x concurrent ComfyUI workers.
Video generation
Wan 2.2 T2V-A14B / I2V-A14B dual-expert bf16; HunyuanVideo 13B bf16 both experts; Open-Sora 2.0 (11B) bf16; CogVideoX-5B; Mochi-1; LTX-Video; Pyramid Flow; SVD / SV3D / SV4D; NVIDIA Cosmos Predict 2.
Audio / Speech / TTS
- ASR: Whisper v3 large / turbo; Parakeet-TDT; Canary; Qwen3-ASR; SenseVoice
- TTS: CosyVoice 2/3; Kokoro; Stable Audio Open; XTTS v2; Step-Audio-EditX
- Realtime / S2S: Kyutai Moshi; Step-Audio 2 mini / R1; Qwen2.5-Omni-7B
- Music / SFX: MusicGen / AudioGen / Bark / SeamlessM4T
Multi-model / multi-tenant serving
- DeepSeek V3 Q3 + concurrent 70B + FLUX.1 + Whisper all resident
- 4-way tensor-parallel on 350-400B class at Q4
- Per-card tenant isolation — one 96 GB Llama 3.3 70B bf16 per card, 4 independent inference silos
- Multi-model RAG: reader + reranker + vision + embedder all on one host
Target workloads
- Frontier open-weight inference for a lab / R&D team where acoustic budget matters
- Small-team server room without dedicated datacenter airflow — Max-Q self-cooling tolerates open-rack placement
- Office-adjacent AI workstation for a specialist team (ML research, agentic tools)
- fp8-native serving (DeepSeek / R1 / Hunyuan) in lab settings
- 4-tenant per-card isolation workload with noise budget
Published performance references
External references | Same silicon as Server Edition | Not measured on Kentino hardware
| Benchmark | Result |
|---|---|
| RTX Pro 6000 per-card INT8 TOPS | 2 000 TOPS |
| RTX Pro 6000 memory bandwidth | ~1 800 GB/s per card |
| vLLM — DeepSeek V3 Q3 on 4x Blackwell PCIe (single) | ~30-40 tok/s |
| vLLM — DeepSeek V3 Q3 on 4x Blackwell PCIe (batch-8) | ~200 tok/s aggregate |
| SGLang — Llama 4 Maverick Q4 on 4x Blackwell (single) | ~45-55 tok/s |
| llama.cpp — Qwen3-Coder-480B Q3 on 4x Blackwell (single) | ~18-25 tok/s |
| FLUX.1 [dev] fp8 on single RTX Pro 6000 | ~1.8 s per 1024x1024 image |
Kentino will publish first-party numbers after initial customer build.
Not ideal for
- Proper datacenter rack deployments with established hot-aisle airflow — choose the passive Server Edition (K-AI 384 Rome RTXPro6000) instead: same silicon, simpler mechanically
- Single-user workloads up to 70B (4x RTX 5090 is materially cheaper for 128 GB pool)
- Frontier training from scratch (no NVLink)
- Full DeepSeek V3 Q4 on-card (~404 GB) — upgrade to 6x RTX Pro 6000 / 576 GB
Warranty and lead time
Build includes assembly, BIOS configuration, driver install, burn-in, memtest, and functional verification. Lead time depends on component availability, confirmed at order.
Recommended add-ons
- Upgrade RAM to 512 GB DDR4 (add 2x 64 GB — 2 DIMM slots open) for RAM-spill headroom on Q3 frontier quants
- 4 TB NVMe Gen4 x4 for frontier-model library (DeepSeek V3 Q3 alone is ~290 GB on disk)
- Full 24U rack cabinet with managed PDU + online UPS
- Alternative silhouette: passive Server Edition (K-AI 384 Rome RTXPro6000) — same silicon, for datacenter airflow deployments
Share
