Kentino s.r.o.
K-AI 192 Rome RTXPro6000 4000TOPS — 2× RTX Pro 6000 Blackwell Server Edition — EPYC Milan
K-AI 192 Rome RTXPro6000 4000TOPS — 2× RTX Pro 6000 Blackwell Server Edition — EPYC Milan
Kan beschikbaarheid voor afhalen niet laden
K-AI 192 Rome RTXPro6000 4000TOPS
192 GB ECC Blackwell Flagship Pair
2x RTX Pro 6000 Server Edition | EPYC Milan | 4 000 TOPS INT8
Two passive RTX Pro 6000 Blackwell Server Edition cards — 96 GB ECC each. Less tensor-parallel overhead than 4- or 8-card builds. Datacenter flagship pair.
A 4U rack-mount inference server with two passive RTX Pro 6000 Blackwell Server Edition cards (96 GB ECC GDDR7 per card), one AMD EPYC 7643 Milan CPU (48C/96T), 256 GB DDR4 ECC, 2 TB NVMe boot, and a single 2 kW ATX PSU. For 70B dense bf16 and mid-size MoE, fewer big cards beat more small cards — two-card tensor parallelism has minimal communication overhead, and each 96 GB card carries a complete copy of most models.
Hardware
| Component | Detail |
|---|---|
| GPUs | 2x NVIDIA RTX Pro 6000 Blackwell Server Edition 96 GB ECC GDDR7 (passive, 600 W, PCIe 5.0 x16, dual-slot) |
| VRAM pool | 192 GB ECC (96 GB x 2) — each card holds a 70B bf16 model standalone |
| CPU | AMD EPYC 7643 Milan (48C/96T, 225 W, 128x PCIe 4.0 lanes) |
| Motherboard | ASRock Rack ROMED8-2T (SP3, 7x PCIe 4.0 x16, 8x DDR4 ECC, 2x 10 GbE, IPMI) |
| System RAM | 256 GB DDR4-2666 ECC RDIMM (4x 64 GB) |
| Boot / storage | 2 TB NVMe M.2 (PCIe 4.0 x4) |
| Power supply | 1x 2 kW ATX PSU |
| Chassis | 4U rack-mount with front-to-back directed airflow |
| Cooling | Arctic Freezer 4U-M SP3 tower + 3x 120 mm front intake + 1x 120 mm rear exhaust |
| Network | Onboard dual 10 GbE (Intel X550) |
Power envelope
- GPU draw: 2 x 600 W = 1 200 W
- System total at full load: ~1 525 W
- PSU total: 2 000 W (single 2 kW) — 23.7 % headroom
- Single PSU sufficient; optional dual-PSU upgrade for N+1 redundancy
Lane topology
PCIe Gen4 x16 per GPU (card is Gen5 native; Rome board caps at Gen4). Direct root-complex connection — no PCIe switch. No NVLink — inter-GPU peer-to-peer. Five x16 slots remain open for expansion. Gen4 vs Gen5 negligible for inference at this VRAM density.
What you can run
With 192 GB ECC VRAM on just two Blackwell cards with native fp8/fp4, this is the cleanest path to dense 70B at bf16 and mid-size MoE. Two independent 70B streams — one per card — or 200B MoE across both with minimal 2-way TP overhead.
LLMs — text / reasoning / coding
Chinese frontier
- Qwen3 / Qwen3.5 (Alibaba): Qwen3-235B-A22B Q4 (~132 GB) comfortable with long ctx (~15-25 tok/s single-stream across 2 cards); Qwen3-Coder-480B-A35B Q2 (~160 GB); Qwen3.5-122B-A10B fp8 (~75 GB); Qwen3-32B dense bf16 with huge KV; QwQ-32B bf16
- DeepSeek: DeepSeek-V3/R1 Q2 (~215 GB with small RAM spill) — Blackwell runs fp8 natively; DeepSeek-R2 32B bf16 two concurrent streams (one per card)
- GLM / Z.ai: GLM-4.5 / 4.6 / 4.7 Q4 (~177 GB) — hero config at this tier; GLM-4.5-Air fp8 or bf16 with huge KV
- Tencent Hunyuan: Hunyuan-Large Q3 (~160 GB) — 389B MoE with 256k ctx; Hunyuan-A13B fp8 native (~80 GB) with huge KV
- Others: Baidu ERNIE-4.5-424B Q3 (~180 GB); InternVL3.5-241B-A28B Q4 (~135 GB); MiniMax-M1 Q3 (~180 GB)
Western frontier
- Meta Llama: Llama 3.3 70B bf16 on one card — two independent concurrent 70B streams (~20-30 tok/s per stream); Llama 4 Scout bf16 (~218 GB, tight); Llama 4 Maverick Q3 (~188 GB)
- Mistral: Mistral Large 2 / Pixtral Large / Devstral 2 123B Q6 (~88 GB) single-card or bf16 across both; Mistral Small 3 multi-stream
- OpenAI (open weights): gpt-oss-120b MXFP4 native (80 GB) — fits on ONE card, two independent concurrent streams
- NVIDIA Nemotron: Llama-3.1-Nemotron Ultra 253B Q4 (~147 GB); Super 49B bf16 on single card
- Others: Cohere Command R+ 104B Q6 (~85 GB) on one card; Google Gemma 3 27B bf16 multiple concurrent streams
Vision-Language Models
InternVL3.5-241B-A28B Q4 (~135 GB); Qwen3-VL-235B-A22B Q4; Qwen3-VL-32B bf16 single-card; Pixtral Large 124B bf16 or Q6; Llama 3.2 90B Vision bf16 (~180 GB); Molmo 72B bf16 (~144 GB); GLM-4.6V 106B fp8; Gemma 3 27B multimodal x 2-3 concurrent streams.
Image generation
FLUX.1 [dev] bf16 multiple concurrent streams; FLUX.1 Kontext [dev]; FLUX Tools; SD 3.5 Large bf16 concurrent; HunyuanImage-2.1 bf16 (~34 GB) x 2-4 concurrent; HunyuanImage-3.0 base (80B MoE, 13B active) bf16 — fits on one card; HunyuanDiT; Kolors / Kolors 2.0; AuraFlow; OmniGen v1; PixArt-Sigma.
Video generation
Wan 2.2 MoE dual-expert bf16 full context — fits on one card, two concurrent generation streams; Wan 2.2 TI2V-5B; HunyuanVideo 13B bf16 both experts; HunyuanVideo 1.5; CogVideoX-5B bf16; Open-Sora 2.0 11B bf16; Mochi-1 bf16 (~42 GB); LTX-Video; Pyramid Flow; SVD / SV3D / SV4D; NVIDIA Cosmos Predict 2.
Audio / Speech / TTS
- ASR: Whisper v3 large / turbo (~50x realtime); Parakeet-TDT; Canary 1B; Qwen3-ASR; SenseVoice
- TTS: CosyVoice 2/3; Kokoro 82M; XTTS v2; Stable Audio Open; Step-Audio-EditX
- Realtime / S2S: Kyutai Moshi 7B; Step-Audio 2 mini/R1; Qwen2.5-Omni-7B
- Music / SFX: MusicGen / AudioGen / Bark; SeamlessM4T v2
Multi-model / multi-tenant serving
- Two independent 70B streams — one per card, simplest form of tenant isolation
- Dense 70B bf16 + supporting stack — LLM on card 1, image/video/audio on card 2
- 200B MoE across both cards — minimal tensor-parallel overhead (2-way split)
- fp8-native frontier — DeepSeek V3 family, Hunyuan-Large fp8 with Blackwell native paths
Target workloads
- Dense 70B bf16 inference — two cards tensor-parallel with minimal overhead, or one model per card for streaming
- 100-150B MoE at Q4-Q6 (GLM-4.5-Air, Qwen3.5-122B-A10B, Hunyuan-A13B, Llama 4 Scout)
- FP8-native frontier inference (DeepSeek V3 family, Hunyuan, Llama 4) — Blackwell runs fp8 natively
- Image + video generation studio at bf16 (Wan 2.2 T2V-A14B, HunyuanVideo 13B, FLUX.1 [dev])
- Long-context document analysis (MiniMax-M1, Kimi-K2 1.58-bit UD with spill)
Measured performance
Published references | NVIDIA RTX Pro 6000 Blackwell Server Edition datasheet + community benchmarks
| Benchmark | Result |
|---|---|
| Per-card INT8 TOPS (NVIDIA datasheet) | 2 000 TOPS |
| Aggregate INT8 TOPS (2 cards) | 4 000 TOPS |
| Memory bandwidth per card | ~1 800 GB/s, 96 GB ECC GDDR7 |
| Llama 3.3 70B bf16 per-card (community) | 15-25 tok/s single-stream, 60-90 tok/s batch |
| Dual-card tensor-parallel 70B (community) | ~30-45 tok/s single-stream expected |
| Blackwell fp8 native | DeepSeek-V3 fp8, Hunyuan-A13B fp8 run without bf16 upcast |
Published external references, not measured on Kentino hardware. Kentino will publish first-party numbers after the first customer build.
Not ideal for
- Very high concurrency multi-tenant serving — 4x L40 or 6x L4 distributes better across more cards
- Heavy KV cache at very long context — step up to K-AI 384 RTXPro6000 8000TOPS
- Training — Kentino does not sell H-class NVLink fabrics
- Budget inference at 192 GB pool — 8x RTX 4090 is cheaper (trading ECC and passive cooling for cost)
Warranty and lead time
NVIDIA OEM 3-year warranty on RTX Pro 6000 Server Edition + Kentino integration warranty. Build includes assembly, BIOS configuration, driver install, burn-in testing, and functional verification. Lead time depends on component availability, confirmed at order.
Recommended add-ons
- Upgrade to dual 2 kW synced PSU for N+1 redundancy
- Upgrade RAM to 512 GB (4 DIMM slots open)
- 4 TB NVMe for large weight libraries and model staging
- Expand to 4-card configuration (K-AI 384 RTXPro6000 8000TOPS) — chassis has slot capacity
- 24U rack cabinet + online UPS 5 kVA
Share
