Kentino s.r.o.
K-AI 96 Rome RTXPro6000 2000TOPS — Single-Card 96 GB Blackwell Workstation Server
K-AI 96 Rome RTXPro6000 2000TOPS — Single-Card 96 GB Blackwell Workstation Server
Δεν ήταν δυνατή η φόρτωση της διαθεσιμότητας παραλαβής
K-AI 96 Rome RTXPro6000 2000TOPS
96 GB ECC Single-Card Workstation Server
1x RTX Pro 6000 Blackwell | EPYC Milan | 2 000 TOPS INT8
One card, 96 GB ECC VRAM, the entire Blackwell tensor pipeline. 70B dense bf16 on a single GPU — no tensor-parallel overhead.
A 4U rack-mount workstation server with a single NVIDIA RTX Pro 6000 Blackwell Workstation card (96 GB ECC GDDR7), one AMD EPYC 7643 Milan CPU (48C/96T), 256 GB DDR4 ECC, 2 TB NVMe boot, and one 2 kW ATX PSU with 54 % headroom. The simplest software path Kentino ships — no tensor-parallel config, no multi-GPU debugging. vLLM, SGLang, llama.cpp, ComfyUI run single-device and just work.
Hardware
| Component | Detail |
|---|---|
| GPU | 1x NVIDIA RTX Pro 6000 Blackwell Workstation 96 GB ECC GDDR7 (600 W, PCIe 5.0 x16) |
| VRAM | 96 GB ECC on a single card — no pooling, no tensor-parallel overhead |
| CPU | AMD EPYC 7643 Milan (48C/96T, 225 W, 128x PCIe 4.0 lanes) |
| Motherboard | ASRock Rack ROMED8-2T (SP3, 7x PCIe 4.0 x16, 8x DDR4 ECC, 2x 10 GbE, IPMI) |
| System RAM | 256 GB DDR4-2666 ECC RDIMM (4x 64 GB) |
| Boot / storage | 2 TB NVMe M.2 (PCIe 4.0 x4) |
| Power supply | 1x 2 kW ATX PSU |
| Chassis | 4U rack-mount (4-slot capacity, 1 populated — room to expand) |
| Cooling | Arctic Freezer 4U-M SP3 tower + 3x 120 mm front intake + 1x 120 mm rear exhaust |
| Network | Onboard dual 10 GbE (Intel X550) |
Power envelope
- GPU draw: 1 x 600 W = 600 W
- System total at full load: ~925 W
- PSU total: 2 000 W — 53.8 % headroom
- Single PSU, simple cabling — generous margin for single-card build
Lane topology
PCIe Gen4 x16 at the GPU (card is Gen5 native; Rome board caps at Gen4). Direct root-complex connection — no PCIe switch. No NVLink required — single card, no inter-GPU link at all. Six x16 slots remain open for NIC / storage / expansion.
What you can run
With 96 GB of ECC VRAM on a single Blackwell card, this server handles 70B dense bf16 on one GPU, open-weight LLMs, vision models, image and video generation, speech AI, and production inference — no tensor-parallel coordination needed.
LLMs — text / reasoning / coding
Chinese frontier
- Qwen3 / Qwen3.5 (Alibaba): Qwen3-32B dense bf16 (~65 GB) with generous KV; Qwen3-72B Q6 (~58 GB, ~25-35 tok/s single-stream); Qwen3-30B-A3B MoE bf16; Qwen3-Coder-30B-A3B agentic at 256k ctx; Qwen3.5-122B-A10B Q4 (~70 GB) with tight KV; QwQ-32B bf16 reasoning
- DeepSeek: DeepSeek-R2 32B sparse MoE bf16 (~64 GB, 92.7 % AIME 2025 single-card); DeepSeek-R1-Distill-Qwen-32B bf16; DeepSeek-V2-Lite 16B full precision
- GLM / Z.ai: GLM-4.5-Air 106B/12B Q4-Q5 (60-70 GB); GLM-4.6V 106B Q4
- Tencent Hunyuan: Hunyuan-A13B 80B/13B MoE Q4-fp8 (~48-80 GB) with 256k ctx and dual-mode reasoning
- ByteDance Seed-OSS-36B bf16 (~72 GB tight) or fp8 (~36 GB) with full 512k native context
- Baidu ERNIE-4.5-47B-A3B Q4-fp8 with long context
Western frontier
- Meta Llama: Llama 3.3 70B at bf16 (~70 GB) on a single card with 8-16k ctx — the hero config; Llama 3.3 70B Q6 (~58 GB, ~35-50 tok/s single-stream); Llama 3.1 8B bf16 (~80-120 tok/s); Llama 3.2 90B Vision Q4 (~52 GB); Llama 4 Scout 109B/17B MoE Q4 (~63 GB)
- Mistral: Mistral Small 3 / Magistral Small 1.2 / Devstral Small 2 (24B) all at bf16 with 256k ctx; Mixtral 8x7B Q6; Codestral Mamba 7B; Pixtral 12B bf16
- OpenAI (open weights): gpt-oss-20b MXFP4 native (16 GB); gpt-oss-120b MXFP4 native (80 GB) — single-card single-stream
- Google Gemma 3: 27B multimodal bf16 (~54 GB) with 128k ctx; 12B / 4B bf16
- Microsoft Phi-4 14B dense bf16; Phi-4-reasoning; Phi-4-multimodal
- NVIDIA Nemotron: Llama-3.1-Nemotron-Super 49B Q6 (~40 GB); Nemotron-Nano 8B
- Others: IBM Granite 4.0 H-Small 32B/9B; OLMo 2 32B; Reka Flash 3 21B; Falcon H1R 7B; Command R 35B
Vision-Language Models
Qwen3-VL-8B / 32B bf16, Qwen3-VL-30B-A3B MoE bf16, Qwen3-Omni-30B-A3B; InternVL3 up to 78B Q4 (~48 GB); InternVL3.5-38B bf16; DeepSeek-VL2 full range; Llama 3.2 11B Vision bf16; Llama 3.2 90B Vision Q4 (~52 GB); Pixtral 12B bf16; Molmo 72B Q4; Molmo 7B bf16; Gemma 3 12B / 27B multimodal; PaliGemma 2 28B; Phi-3.5-Vision; Aya Vision 8B / 32B; MiniCPM-V 2.6 / MiniCPM-o 2.6; GLM-4.6V.
Image generation
FLUX.1 [dev] / [schnell] bf16 (~24 GB) and quantized (~15-25 s/image at fp8); FLUX.1 Kontext [dev] in-context editing; FLUX Tools (Fill / Depth / Canny / Redux); SD 3.5 Large bf16 (~18 GB); SDXL 1.0; HunyuanImage-2.1 bf16 (~34 GB) at 2K native; HunyuanDiT 1.5B; Kolors / Kolors 2.0; AuraFlow v0.3; OmniGen v1; PixArt-Sigma.
Video generation
Wan 2.2 T2V-A14B / I2V-A14B MoE bf16 (~54 GB, both experts resident); Wan 2.2 TI2V-5B fast path; HunyuanVideo 13B bf16 (~60-80 GB, tight at 720p); HunyuanVideo 1.5 (8.3B); CogVideoX-5B; Open-Sora 2.0 (11B) bf16; Genmo Mochi-1 bf16 (~42 GB); LTX-Video; Pyramid Flow; SVD / SV3D / SV4D; NVIDIA Cosmos Predict 2.
Audio / Speech / TTS
- ASR: Whisper v3 large / turbo (~50x realtime); NVIDIA Parakeet-TDT 1.1B; Canary 1B; Qwen3-ASR; SenseVoice
- TTS: CosyVoice 2 / Fun-CosyVoice 3.0; Kokoro 82M; Stable Audio Open; Coqui XTTS v2; StyleTTS 2; Step-Audio-EditX
- Realtime / S2S: Kyutai Moshi (200 ms full-duplex); Step-Audio 2 mini; Step-Audio-R1 / R1.1; Qwen2.5-Omni-7B
- Music / SFX: Meta MusicGen; AudioGen; Suno Bark; SeamlessM4T v2
Multi-model / multi-tenant serving
- Single-tenant streaming coding assistant — 70B dense bf16, low latency, no TP penalty
- Mixed resident stack: Qwen3-32B bf16 + FLUX.1 fp8 + Whisper-turbo + Moshi on one card with partitioned VRAM
- Fine-tuning: LoRA / QLoRA on 13-34B models; full-param on 7B
- Embedding service: BGE-M3 / E5 / Jina resident alongside a generator LLM
Target workloads
- Single-tenant streaming coding assistant running Llama 3.3 70B bf16 or Qwen3-Coder-30B-A3B — no TP coordination overhead
- Developer workstation for a single engineer or tight team needing a 70B-class model with 32-128k context
- Video or image generation lab — HunyuanVideo 13B, Wan 2.2 dual-expert, HunyuanImage-2.1 all at bf16 resident
- VLM / OCR bench — Qwen3-VL-32B bf16 or InternVL3.5-38B with long-document pipelines
- Clean appliance for a small LLM API gateway — one model, one card, easy ops
Measured performance
Published references | NVIDIA RTX Pro 6000 Blackwell datasheet + community benchmarks
| Benchmark | Result |
|---|---|
| Per-card INT8 TOPS (NVIDIA datasheet) | 2 000 TOPS |
| VRAM per card | 96 GB ECC GDDR7 |
| Memory bandwidth | ~1 800 GB/s |
| Llama 3.3 70B Q6 single-GPU (community) | 40-55 tok/s single-stream |
| Llama 3.3 70B bf16 single-GPU (community) | 15-25 tok/s single-stream |
| Blackwell fp8 native | DeepSeek-V3 fp8, Hunyuan-A13B fp8 run without bf16 upcast |
Published external references, not measured on Kentino hardware. Kentino will publish first-party numbers after the first customer build.
Not ideal for
- Training large models from scratch (single GPU — no tensor/pipeline parallelism)
- Frontier 200B+ MoE at real quantizations (Qwen3-235B Q4, GLM-4.5/4.6 — use K-AI 192 RTXPro6000 or larger)
- High-concurrency multi-tenant inference (single card caps aggregate throughput; 4x RTX 4090 or 4x L40 scale better)
Warranty and lead time
NVIDIA OEM 3-year warranty on RTX Pro 6000 + Kentino integration warranty. Build includes assembly, BIOS configuration, driver install, burn-in testing, and functional verification. Lead time depends on component availability, confirmed at order.
Recommended add-ons
- Upgrade RAM to 512 GB (add 4x 64 GB DDR4 — four DIMM slots still open)
- 4 TB NVMe secondary drive for model library / dataset staging
- 24U open cabinet for production rack-mount
- For Gen5 x16 link speed consider the Genoa-platform variant on request
Share
