Прескочи към информацията за продукта
1 от 7

Kentino s.r.o.

K-AI 96 Rome 4090 2644TOPS — 4× RTX 4090 AI Inference Server

K-AI 96 Rome 4090 2644TOPS — 4× RTX 4090 AI Inference Server

Обичайна цена €18.491,00 EUR
Обичайна цена Цена при разпродажба €18.491,00 EUR
Разпродажба Изчерпано
С включени данъци. Доставката се изчислява при плащане.

K-AI 96 Rome 4090 2644TOPS

96 GB VRAM Inference Server
4x RTX 4090 | EPYC Rome | 2 644 TOPS INT8

647
TFLOPS fp16
179
tok/s batch-32
96 GB
VRAM pool
24/7
rack-ready

Measured on Kentino hardware. Llama 3.3 70B AWQ INT4 via vLLM 0.19.0.

A 4U rack-mount inference server with four GeForce RTX 4090 pooled to 96 GB VRAM, one AMD EPYC 7542 Rome CPU (32C/64T), 256 GB DDR4 ECC, 2 TB NVMe boot, and dual synchronized 2 kW ATX PSU. Runs vLLM, SGLang, llama.cpp, ComfyUI and every major open-weight inference stack out of the box.

Hardware

Component Detail
GPUs 4x NVIDIA GeForce RTX 4090 24 GB GDDR6X (450 W, PCIe 4.0 x16)
VRAM pool 96 GB total across 4 cards
CPU AMD EPYC 7542 Rome (32C/64T, 225 W, 128x PCIe 4.0 lanes)
Motherboard ASRock Rack ROMED8-2T (SP3, 7x PCIe 4.0 x16, 8x DDR4 ECC, 2x 10 GbE, IPMI)
System RAM 256 GB DDR4-2666 ECC RDIMM (4x 64 GB)
Storage 2 TB NVMe M.2 (PCIe 4.0 x4)
PSU Dual 2 kW ATX with sync cable
Chassis 4U rack-mount, front-to-back directed airflow
Cooling SP3 tower cooler, 3x front + 1x rear 120 mm industrial fans
Network Onboard dual 10 GbE (Intel X550)

Power envelope

  • GPU draw: 4 x 450 W = 1 800 W
  • System total: ~2 125 W
  • PSU total: 4 000 W (dual 2 kW) — 46.9% headroom
  • Split power delivery — single PSU failure = loss of 2 GPUs or 2 GPUs + motherboard

Lane topology

128 PCIe Gen4 lanes from EPYC to seven x16 slots; four populated by GPUs at Gen4 x16. No PCIe switch. No NVLink — peer-to-peer at 19-22 GB/s (Kentino measured).

What you can run

With 96 GB of pooled VRAM across 4 cards, this server handles open-weight LLMs, vision models, image and video generation, speech AI, and multi-tenant serving.

LLMs — text / reasoning / coding

Chinese frontier

  • Qwen3 / Qwen3.5: Qwen3-72B Q4 (~15-20 tok/s); Qwen3-32B Q6; Qwen3-30B-A3B MoE Q4-Q6; Qwen3-Coder-30B-A3B at 256k; Qwen3.5-122B-A10B Q4; QwQ-32B
  • DeepSeek: DeepSeek-R2 32B Q4-Q6 (92.7% AIME 2025); DeepSeek-R1-Distill-Qwen-32B bf16; DeepSeek-V2-Lite 16B
  • GLM / Z.ai: GLM-4.5-Air 106B/12B Q4-Q5; GLM-4.6V-Flash; GLM-Zero 9B
  • Hunyuan: Hunyuan-A13B Q4-Q6 (~48 GB) 256k ctx dual-mode reasoning
  • Others: Seed-OSS-36B Q4 512k ctx; ERNIE-4.5-47B-A3B Q4; Yi-34B Q6; Baichuan-M2-32B; Step-3.5-Flash

Western frontier

  • Meta Llama: Llama 3.3 70B Q4_K_M (~20 tok/s llama.cpp, ~179 tok/s batch-32 vLLM — Kentino measured); Llama 3.1 8B bf16 (~80-120 tok/s); Llama 4 Scout Q4
  • Mistral: Small 3 24B bf16; Magistral Small 24B reasoning; Devstral Small 2 24B 256k ctx; Mixtral 8x7B Q6
  • OpenAI: gpt-oss-20b MXFP4 (16 GB); gpt-oss-120b MXFP4 (80 GB tight)
  • Others: Gemma 3 27B Q6 128k; Phi-4 14B bf16; Nemotron-Super 49B Q4; Granite 4.0 H-Small; OLMo 2 32B; Reka Flash 3; Command R 35B

Vision-Language Models

Qwen3-VL-8B/32B, Qwen3-VL-30B-A3B, Qwen3-Omni-30B-A3B; InternVL3 up to 78B Q4; InternVL3.5-38B; DeepSeek-VL2; Llama 3.2 11B Vision; Pixtral 12B; Molmo 7B; Gemma 3 12B/27B; PaliGemma 2; MiniCPM-V 2.6 / MiniCPM-o 2.6.

Image generation

FLUX.1 [dev]/[schnell] fp8 (~15-25 s per 1024x1024); FLUX.1 Kontext; FLUX Tools; SD 3.5 Large; SDXL; HunyuanImage-2.1 bf16 (~34 GB) 2K native; Kolors 2.0; AuraFlow; OmniGen v1.

Video generation

Wan 2.2 T2V-A14B/I2V-A14B MoE (~54 GB bf16); Wan 2.2 TI2V-5B 720p@24fps; HunyuanVideo 13B Q4-Q5; HunyuanVideo 1.5; CogVideoX-5B; Open-Sora 2.0; Mochi-1; LTX-Video; SVD/SV3D/SV4D; NVIDIA Cosmos Predict 2.

Audio / Speech / TTS

  • ASR: Whisper v3 turbo (~50x realtime); Parakeet-TDT 1.1B; Canary 1B; Qwen3-ASR; SenseVoice
  • TTS: CosyVoice 3.0; Kokoro 82M; Stable Audio Open; Step-Audio-EditX
  • Realtime: Kyutai Moshi (200 ms full-duplex); Step-Audio 2 mini; Qwen2.5-Omni-7B
  • Music: MusicGen; AudioGen; Suno Bark; SeamlessM4T v2

Multi-model serving

  • 4-8 concurrent users on 32-72B LLMs via vLLM / SGLang tensor-parallel
  • Mixed: Qwen3-32B + FLUX.1 + Whisper-turbo + Moshi with partitioned VRAM
  • LoRA/QLoRA fine-tuning 32-72B; full-param 7-14B
  • RAG with Command R+ or Qwen3 + BGE-M3/E5/Jina

Target workloads

  • Inference gateway for 50-200 seat org (70B Q4-Q6, 4-8 concurrent sessions)
  • Batch diffusion/video pipeline (SDXL + FLUX.1 + Wan 2.2 overnight)
  • LoRA/QLoRA fine-tuning lab for 7-34B domain adaptations
  • RAG document assistant (Qwen3-VL + BGE-M3 + Command R, 32k ctx)
  • Mixed single-box: chat + image + ASR + realtime voice on partitioned VRAM

Measured performance

Kentino bench | 2026-04-10 | 4x RTX 4090 + EPYC 7542 + ROMED8-2T

Benchmark Result
Sustained compute (fp16) 647.7 TFLOPS
vLLM Llama 3.3 70B AWQ INT4 (single) 8.0 tok/s
vLLM Llama 3.3 70B AWQ INT4 (batch-32) 179.3 tok/s aggregate
llama.cpp Llama 3.3 70B Q4_K_M (single) 20.3 tok/s
Prompt eval 1 568 tok/s
GPU memory bandwidth 920 GB/s per card
NVMe read/write 4 589 / 4 213 MB/s
Peak thermal (GPU+CPU burn) 73 C, 0.6% drop

vLLM used awq kernel — 2-3x possible with awq_marlin.

Not ideal for

  • Frontier 100B+ dense at bf16 (DeepSeek V3/R1, GLM-4.5+, Kimi-K2, Mistral Large 3 — require 256+ GB VRAM)
  • Training from scratch (consumer RTX 4090 lack NVLink)

Warranty and lead time

2 years
parts warranty
1 year
labor warranty
10-28 days
lead time

Build includes assembly, BIOS configuration, driver install, burn-in testing, and functional verification. Lead time depends on component availability, confirmed at order.

Recommended add-ons

  • Upgrade RAM to 512 GB (add 4x 64 GB DDR4 — four DIMM slots open)
  • 4 TB NVMe secondary drive for dataset/model staging
  • 24U open cabinet for multi-server deployments
Покажи пълните подробности