Kentino s.r.o.

Kentino AI 96 Rome RTXPro6000 2000TOPS — Single-Card 96 GB Blackwell Workstation-Server

Name: Kentino AI 96 Rome RTXPro6000 2000TOPS — Single-Card 96 GB Blackwell Workstation-Server
Brand: Kentino s.r.o.
Price: 20900.00 EUR
Availability: InStock

€20.900,00 EUR

Sale Ausverkauft

Versand wird beim Checkout berechnet

Anzahl

Kentino AI 96 Rome RTXPro6000 2000TOPS

96 GB ECC Single-Card Workstation Server
1x RTX Pro 6000 Blackwell | EPYC Milan | 2 000 TOPS INT8

2 000

INT8 TOPS

96 GB

ECC VRAM

single

card design

fp8

native Blackwell

One card, 96 GB ECC VRAM, the entire Blackwell tensor pipeline. 70B dense bf16 on a single GPU — no tensor-parallel overhead.

A 4U rack-mount workstation server with a single NVIDIA RTX Pro 6000 Blackwell Workstation card (96 GB ECC GDDR7), one AMD EPYC 7643 Milan CPU (48C/96T), 256 GB DDR4 ECC, 2 TB NVMe boot, and one 2 kW ATX PSU with 54 % headroom. The simplest software path Kentino ships — no tensor-parallel config, no multi-GPU debugging. vLLM, SGLang, llama.cpp, ComfyUI run single-device and just work.

Hardware

Component	Detail
GPU	1x NVIDIA RTX Pro 6000 Blackwell Workstation 96 GB ECC GDDR7 (600 W, PCIe 5.0 x16)
VRAM	96 GB ECC on a single card — no pooling, no tensor-parallel overhead
CPU	AMD EPYC 7643 Milan (48C/96T, 225 W, 128x PCIe 4.0 lanes)
Motherboard	ASRock Rack ROMED8-2T (SP3, 7x PCIe 4.0 x16, 8x DDR4 ECC, 2x 10 GbE, IPMI)
System RAM	256 GB DDR4-2666 ECC RDIMM (4x 64 GB)
Boot / storage	2 TB NVMe M.2 (PCIe 4.0 x4)
Power supply	1x 2 kW ATX PSU
Chassis	4U rack-mount (4-slot capacity, 1 populated — room to expand)
Cooling	Arctic Freezer 4U-M SP3 tower + 3x 120 mm front intake + 1x 120 mm rear exhaust
Network	Onboard dual 10 GbE (Intel X550)

Power envelope

GPU draw: 1 x 600 W = 600 W
System total at full load: ~925 W
PSU total: 2 000 W — 53.8 % headroom
Single PSU, simple cabling — generous margin for single-card build

Lane topology

PCIe Gen4 x16 at the GPU (card is Gen5 native; Rome board caps at Gen4). Direct root-complex connection — no PCIe switch. No NVLink required — single card, no inter-GPU link at all. Six x16 slots remain open for NIC / storage / expansion.

What you can run

With 96 GB of ECC VRAM on a single Blackwell card, this server handles 70B dense bf16 on one GPU, open-weight LLMs, vision models, image and video generation, speech AI, and production inference — no tensor-parallel coordination needed.

LLMs — text / reasoning / coding

Chinese frontier

Qwen3 / Qwen3.5 (Alibaba): Qwen3-32B dense bf16 (~65 GB) with generous KV; Qwen3-72B Q6 (~58 GB, ~25-35 tok/s single-stream); Qwen3-30B-A3B MoE bf16; Qwen3-Coder-30B-A3B agentic at 256k ctx; Qwen3.5-122B-A10B Q4 (~70 GB) with tight KV; QwQ-32B bf16 reasoning
DeepSeek: DeepSeek-R2 32B sparse MoE bf16 (~64 GB, 92.7 % AIME 2025 single-card); DeepSeek-R1-Distill-Qwen-32B bf16; DeepSeek-V2-Lite 16B full precision
GLM / Z.ai: GLM-4.5-Air 106B/12B Q4-Q5 (60-70 GB); GLM-4.6V 106B Q4
Tencent Hunyuan: Hunyuan-A13B 80B/13B MoE Q4-fp8 (~48-80 GB) with 256k ctx and dual-mode reasoning
ByteDance Seed-OSS-36B bf16 (~72 GB tight) or fp8 (~36 GB) with full 512k native context
Baidu ERNIE-4.5-47B-A3B Q4-fp8 with long context

Western frontier

Meta Llama: Llama 3.3 70B at bf16 (~70 GB) on a single card with 8-16k ctx — the hero config; Llama 3.3 70B Q6 (~58 GB, ~35-50 tok/s single-stream); Llama 3.1 8B bf16 (~80-120 tok/s); Llama 3.2 90B Vision Q4 (~52 GB); Llama 4 Scout 109B/17B MoE Q4 (~63 GB)
Mistral: Mistral Small 3 / Magistral Small 1.2 / Devstral Small 2 (24B) all at bf16 with 256k ctx; Mixtral 8x7B Q6; Codestral Mamba 7B; Pixtral 12B bf16
OpenAI (open weights): gpt-oss-20b MXFP4 native (16 GB); gpt-oss-120b MXFP4 native (80 GB) — single-card single-stream
Google Gemma 3: 27B multimodal bf16 (~54 GB) with 128k ctx; 12B / 4B bf16
Microsoft Phi-4 14B dense bf16; Phi-4-reasoning; Phi-4-multimodal
NVIDIA Nemotron: Llama-3.1-Nemotron-Super 49B Q6 (~40 GB); Nemotron-Nano 8B
Others: IBM Granite 4.0 H-Small 32B/9B; OLMo 2 32B; Reka Flash 3 21B; Falcon H1R 7B; Command R 35B

Vision-Language Models

Qwen3-VL-8B / 32B bf16, Qwen3-VL-30B-A3B MoE bf16, Qwen3-Omni-30B-A3B; InternVL3 up to 78B Q4 (~48 GB); InternVL3.5-38B bf16; DeepSeek-VL2 full range; Llama 3.2 11B Vision bf16; Llama 3.2 90B Vision Q4 (~52 GB); Pixtral 12B bf16; Molmo 72B Q4; Molmo 7B bf16; Gemma 3 12B / 27B multimodal; PaliGemma 2 28B; Phi-3.5-Vision; Aya Vision 8B / 32B; MiniCPM-V 2.6 / MiniCPM-o 2.6; GLM-4.6V.

Image generation

FLUX.1 [dev] / [schnell] bf16 (~24 GB) and quantized (~15-25 s/image at fp8); FLUX.1 Kontext [dev] in-context editing; FLUX Tools (Fill / Depth / Canny / Redux); SD 3.5 Large bf16 (~18 GB); SDXL 1.0; HunyuanImage-2.1 bf16 (~34 GB) at 2K native; HunyuanDiT 1.5B; Kolors / Kolors 2.0; AuraFlow v0.3; OmniGen v1; PixArt-Sigma.

Video generation

Wan 2.2 T2V-A14B / I2V-A14B MoE bf16 (~54 GB, both experts resident); Wan 2.2 TI2V-5B fast path; HunyuanVideo 13B bf16 (~60-80 GB, tight at 720p); HunyuanVideo 1.5 (8.3B); CogVideoX-5B; Open-Sora 2.0 (11B) bf16; Genmo Mochi-1 bf16 (~42 GB); LTX-Video; Pyramid Flow; SVD / SV3D / SV4D; NVIDIA Cosmos Predict 2.

Audio / Speech / TTS

ASR: Whisper v3 large / turbo (~50x realtime); NVIDIA Parakeet-TDT 1.1B; Canary 1B; Qwen3-ASR; SenseVoice
TTS: CosyVoice 2 / Fun-CosyVoice 3.0; Kokoro 82M; Stable Audio Open; Coqui XTTS v2; StyleTTS 2; Step-Audio-EditX
Realtime / S2S: Kyutai Moshi (200 ms full-duplex); Step-Audio 2 mini; Step-Audio-R1 / R1.1; Qwen2.5-Omni-7B
Music / SFX: Meta MusicGen; AudioGen; Suno Bark; SeamlessM4T v2

Multi-model / multi-tenant serving

Single-tenant streaming coding assistant — 70B dense bf16, low latency, no TP penalty
Mixed resident stack: Qwen3-32B bf16 + FLUX.1 fp8 + Whisper-turbo + Moshi on one card with partitioned VRAM
Fine-tuning: LoRA / QLoRA on 13-34B models; full-param on 7B
Embedding service: BGE-M3 / E5 / Jina resident alongside a generator LLM

Target workloads

Single-tenant streaming coding assistant running Llama 3.3 70B bf16 or Qwen3-Coder-30B-A3B — no TP coordination overhead
Developer workstation for a single engineer or tight team needing a 70B-class model with 32-128k context
Video or image generation lab — HunyuanVideo 13B, Wan 2.2 dual-expert, HunyuanImage-2.1 all at bf16 resident
VLM / OCR bench — Qwen3-VL-32B bf16 or InternVL3.5-38B with long-document pipelines
Clean appliance for a small LLM API gateway — one model, one card, easy ops

Measured performance

Published references | NVIDIA RTX Pro 6000 Blackwell datasheet + community benchmarks

Benchmark	Result
Per-card INT8 TOPS (NVIDIA datasheet)	2 000 TOPS
VRAM per card	96 GB ECC GDDR7
Memory bandwidth	~1 800 GB/s
Llama 3.3 70B Q6 single-GPU (community)	40-55 tok/s single-stream
Llama 3.3 70B bf16 single-GPU (community)	15-25 tok/s single-stream
Blackwell fp8 native	DeepSeek-V3 fp8, Hunyuan-A13B fp8 run without bf16 upcast

Published external references, not measured on Kentino hardware. Kentino will publish first-party numbers after the first customer build.

Not ideal for

Training large models from scratch (single GPU — no tensor/pipeline parallelism)
Frontier 200B+ MoE at real quantizations (Qwen3-235B Q4, GLM-4.5/4.6 — use Kentino AI 192 RTXPro6000 or larger)
High-concurrency multi-tenant inference (single card caps aggregate throughput; 4x RTX 4090 or 4x L40 scale better)

Warranty and lead time

2 years

parts warranty

1 year

labor warranty

10-28 days

lead time

NVIDIA OEM 3-year warranty on RTX Pro 6000 + Kentino integration warranty. Build includes assembly, BIOS configuration, driver install, burn-in testing, and functional verification. Lead time depends on component availability, confirmed at order.