Kentino s.r.o.
K-AI 576 Genoa RTXPro6000 12000TOPS — 6× RTX Pro 6000 Blackwell Server Edition AI Frontier Server
K-AI 576 Genoa RTXPro6000 12000TOPS — 6× RTX Pro 6000 Blackwell Server Edition AI Frontier Server
Nepodarilo sa načítať dostupnosť na vyzdvihnutie
K-AI 576 Genoa RTXPro6000 12000TOPS
576 GB ECC VRAM Frontier Research Server
6x RTX Pro 6000 Server Edition | EPYC Genoa | 12 000 TOPS INT8
Published external references. Not measured on Kentino hardware.
A 7U rack-mount frontier-tier inference platform with six NVIDIA RTX Pro 6000 Blackwell Server Edition passive cards pooled to 576 GB ECC VRAM, one AMD EPYC 9354 Genoa CPU (32C/64T), 768 GB DDR5-4800 ECC (all 12 channels populated), 4 TB NVMe boot, and 5x 1200 W server PSU. On-board Broadcom PCIe Gen5 switch fans out uniformly to all 6 GPU slots. DeepSeek V3 Q4 (~404 GB) comfortable with long context, Kimi-K2 Q2, Mistral Large 3 Q2-Q3 — the full frontier on-prem.
Hardware
| Component | Detail |
|---|---|
| GPUs | 6x NVIDIA RTX Pro 6000 Blackwell Server Edition 96 GB ECC (passive, 600 W, PCIe 5.0 x16, 2000 INT8 TOPS per card) |
| VRAM pool | 576 GB total across 6 cards (no NVLink — P2P over PCIe Gen5 at ~55-60 GB/s per direction) |
| CPU | AMD EPYC 9354 Genoa (32C/64T, 280 W, 128x PCIe 5.0 lanes, 12-channel DDR5) |
| Motherboard | ASRock Rack GENOAD8X-2T/BCM (SP5 Genoa, integrated Broadcom PEX PCIe Gen5 switch, 12x DDR5, 2x 10 GbE, IPMI) |
| System RAM | 768 GB DDR5-4800 ECC RDIMM (12x 64 GB — all channels populated, ~460 GB/s aggregate) |
| Boot / storage | 4 TB NVMe M.2 (PCIe 4.0 x4) — sized for frontier checkpoint staging |
| Power supply | 5x 1200 W server PSU set (HP-compatible, 6 kW total) |
| Chassis | 7U 8-GPU rack-mount, 10 PCIe slot capacity, active Gen5 risers |
| Cooling | SP5 Genoa tower cooler, 8x 120 mm chassis fans, front-to-back datacenter airflow required. Passive GPU cards. |
| Network | Onboard dual 10 GbE (Intel X550) |
Power envelope
- GPU draw: 6 x 600 W = 3 600 W
- System total at full load: ~4 080 W
- PSU total: 6 000 W (5x 1200 W) — 32% headroom
- No power-cap required for steady-state inference
Lane topology
GENOAD8X-2T/BCM integrates a Broadcom PEX PCIe Gen5 switch on-board. 128 Gen5 lanes from the EPYC Genoa root upstream the switch, which fans out uniformly to all 6 GPU slots at Gen5 x16 end-to-end via active risers. Clean single-root topology — simpler NUMA tuning than dual-socket. No NVLink; P2P at ~55-60 GB/s per direction.
What you can run
With 576 GB of pooled ECC VRAM on Blackwell fp8 native silicon, this server runs the full Chinese + Western open-weight frontier at research-grade quants: DeepSeek V3 Q4 (~404 GB) with long context, Kimi-K2 Q2, Mistral Large 3 Q2-Q3, GLM-5 Q2, Qwen3-Coder-480B Q4.
LLMs — text / reasoning / coding
Chinese frontier
- DeepSeek V3 / R1 / V3.1 / V3.2 at Q4_K_M (~404 GB) comfortable with long context (~5-8 tok/s single vLLM TP-6, published reference); fp8 native (~670 GB) with RAM spill
- Kimi-K2 (Base / Instruct / Thinking) at Q2_K (~375 GB) comfortable (~5-8 tok/s single, published reference)
- GLM-5 / GLM-5.1 (~745B/44B) at Q2_K (~260 GB) comfortable; Q3 (~420 GB) with RAM spill
- Qwen3-Coder-480B-A35B at Q4_K_M (~270 GB) with long context
- Qwen3-235B-A22B at bf16 (~470 GB) or fp8 (~240 GB)
- ERNIE-4.5-424B-A47B at Q4 (~240 GB) with full 128k ctx
- Intern-S1-Pro (1T/22B active, SAGE) at Q2_K (~325 GB) comfortable
- Hunyuan-Large A52B at Q4 (~220 GB); MiniMax-M1 at Q4 (~260 GB)
Western frontier
- Mistral Large 3 (675B/41B MoE, Apache 2.0) at Q2-Q3 (~243-317 GB) comfortable (~20-30 tok/s single, published reference)
- Llama 4 Maverick (400B/17B) at Q4_K_M (~232 GB) with long ctx (~45-55 tok/s single, published reference)
- Llama-3.1-Nemotron Ultra 253B at fp8 (~253 GB) or bf16 with RAM spill
- Grok-1 314B at Q4 (~182 GB); Snowflake Arctic at Q4 (~278 GB)
- DBRX Instruct 132B/36B at bf16 (~264 GB) or fp8 multi-instance
- All 70-120B class models at bf16 with room to spare
Vision-Language Models
Qwen3-VL-235B-A22B flagship VLM; InternVL3.5-241B-A28B Q4 (~135 GB); GLM-4.5V / 4.6V 106B bf16 (~210 GB); Llama 3.2 90B Vision bf16; Pixtral Large 124B fp8; Molmo 72B bf16.
Image generation
HunyuanImage-3.0 Instruct tier (3x 80 GB) — fits with headroom; FLUX.1 [dev] / [schnell] / Kontext multi-instance (~15-20 s per 1024x1024 image on single RTX Pro 6000 fp8, published reference); SD 3.5 Large; SDXL; AuraFlow; OmniGen; HunyuanImage-2.1; Kolors 2.0.
Video generation
Wan 2.2 T2V-A14B / I2V-A14B dual-expert MoE bf16 (~54 GB); HunyuanVideo 13B bf16 comfortable; Open-Sora 2.0 (11B) bf16; Mochi-1 (10B) fp16; NVIDIA Cosmos Predict 2 up to 14B; CogVideoX-5B; LTX-Video; Pyramid Flow.
Audio / Speech / TTS
Full stack resident concurrently: Whisper v3 large, Parakeet-TDT 1.1B, Canary 1B, Moshi 7B realtime, Qwen3-Omni, Step-Audio R1, CosyVoice 3.0, Kokoro, Stable Audio Open.
Multi-model / multi-tenant serving
- DeepSeek V3 Q4 inference + FLUX image + HunyuanVideo + Whisper/Moshi realtime voice all resident simultaneously
- Concurrent 70B tensor-parallel + 235B-MoE on separate PCIe domains via the Broadcom switch
- Research A/B evaluation: 3 frontier open-weight models resident concurrently
Target workloads
- Frontier open-weight research lab — on-prem access to DeepSeek V3 / Kimi-K2 / Mistral Large 3 class without cloud egress
- Sovereign AI deployment — EU data residency with an Apache 2.0 / MIT model stack
- Enterprise multi-model RAG + agentic platform — several 200-400B MoE models resident
- Model evaluation / safety research comparing frontier Chinese vs Western open weights
- Inference-at-scale for regulated industries requiring air-gap + ECC + PCIe Gen5
Published performance references
External references | Not measured on Kentino hardware
| Benchmark | Result |
|---|---|
| RTX Pro 6000 per-card INT8 TOPS | 2 000 TOPS |
| vLLM — DeepSeek V3 Q4 on 6x RTX Pro 6000 (single) | ~25-40 tok/s |
| vLLM — DeepSeek V3 Q4 on 6x RTX Pro 6000 (batch-32) | 200-400 tok/s aggregate |
| FLUX.1 [dev] fp8 on single RTX Pro 6000 | ~15-20 s per 1024x1024 image |
Exact figures confirmed at PoC stage. Kentino will publish first-party numbers after initial customer build.
Not ideal for
- Kimi-K2 / DeepSeek V3 at Q4 real-speed production serving — step up to the 768 GB Turin dual
- Training from scratch on frontier-class models — no NVLink, PCIe P2P only
- Plug-and-play deployment — frontier MoE serving needs a skilled MLOps team
Warranty and lead time
Build includes assembly, BIOS config, driver install, burn-in, memtest, functional verification, and LLM environment setup (vLLM / SGLang / llama.cpp / CUDA 13 stack with fp8 Blackwell kernels). Lead time depends on component availability, confirmed at order.
Recommended add-ons
- NVIDIA ConnectX-5 MCX555A-ECAT 100 GbE NIC for multi-node scale-out
- Second 4 TB NVMe for dataset / model library
- Full 24U rack cabinet with front perforated door
- Online UPS 10 kVA
- Managed PDU
Share
