Kentino s.r.o.
K-AI 288 Rome L40 — 6× NVIDIA L40 Passive Enterprise (288 GB ECC VRAM)
K-AI 288 Rome L40 — 6× NVIDIA L40 Passive Enterprise (288 GB ECC VRAM)
无法加载取货服务可用情况
K-AI 288 Rome L40 2172TOPS
288 GB ECC VRAM Enterprise Server
6x NVIDIA L40 Passive | EPYC Milan | 2 172 TOPS INT8
Published external references. Not measured on Kentino hardware.
A 4U rack-mount enterprise inference server with six NVIDIA L40 Ada Lovelace passive datacenter cards (48 GB ECC each) pooled to 288 GB ECC VRAM, one AMD EPYC 7643 Milan CPU (48C/96T), 384 GB DDR4-2666 ECC, 2 TB NVMe boot, and dual synchronized 2.5 kW ATX PSU. ECC end-to-end, purpose-built for 24/7 enterprise production where bit-level integrity and serviceable failure domains matter.
Hardware
| Component | Detail |
|---|---|
| GPUs | 6x NVIDIA L40 48 GB ECC (Ada Lovelace, passive datacenter, 300 W, PCIe 4.0 x16, dual-slot, 362 INT8 TOPS/card) |
| VRAM pool | 288 GB aggregate ECC across 6 cards (no NVLink on L40 PCIe SKU) |
| CPU | AMD EPYC 7643 Milan (48C/96T, 225 W, 128x PCIe 4.0 lanes) |
| Motherboard | ASRock Rack ROMED8-2T (SP3, 7x PCIe 4.0 x16, 8x DDR4 ECC, 2x 10 GbE, IPMI) |
| System RAM | 384 GB DDR4-2666 ECC RDIMM (6x 64 GB — 2 DIMM slots open for upgrade to 512 GB) |
| Boot / storage | 2 TB NVMe M.2 (PCIe 4.0 x4) |
| Power supply | 2x 2.5 kW ATX with dual-PSU sync cable (5 kW aggregate) |
| Chassis | 4U rack-mount (6-slot layout) |
| Cooling | SP3 tower cooler (Arctic Freezer 4U-M class) + front-to-back directed airflow (industrial fans) |
| Network | Onboard dual 10 GbE (Intel X550) |
Power envelope
- GPU draw: 6 x 300 W = 1 800 W
- System total under full load: ~2 175 W
- PSU total: 5 000 W (dual 2.5 kW synced) — 56.5% headroom
- Dual PSU for split power delivery — single PSU failure = loss of 2 GPUs or 2 GPUs + motherboard
Lane topology
ROMED8-2T exposes 7x PCIe 4.0 x16 direct from EPYC Milan. Six slots populated with passive Gen4 x16 risers — one free slot for NIC / storage. No PCIe switch required. L40 native link is PCIe 4.0 x16 — no bandwidth loss. No NVLink; inter-GPU traffic runs PCIe peer-to-peer.
What you can run
With 288 GB of pooled ECC VRAM across 6 passive L40 cards, this server handles frontier open-weight LLMs at Q4, multi-model concurrent serving, video/media pipelines, and 24/7 enterprise production inference. Note: L40 is Ada Lovelace, not Blackwell — fp8 upcasts to bf16. Use GGUF Q4/Q5 or AWQ/GPTQ int4 for maximum VRAM efficiency.
LLMs — text / reasoning / coding
Chinese frontier
- Qwen3-235B-A22B Q4 (~132 GB) with very long context + generous KV budget (~15-20 tok/s single, published reference)
- GLM-4.5 / 4.6 / 4.7 Q4 (~177 GB) comfortable on 6-way TP (~12-18 tok/s single, published reference)
- Hunyuan-Large 389B/52B Q3 (~160 GB); ERNIE-4.5-424B-A47B Q3 (~180 GB)
- Qwen3-Coder-480B-A35B Q2 (~160 GB) flagship coding agent
- MiniMax-M1 / Text-01 Q3 (~180 GB) 1M-ctx Lightning Attention
- Qwen3-30B-A3B / QwQ-32B / Qwen3-32B — single-card with 6 parallel streams
- DeepSeek-R2 32B sparse MoE — single card per stream, 6 concurrent sessions
Western frontier
- Llama 3.3 70B bf16 (~142 GB) multi-tenant serving (~17 tok/s single, published reference), or Q4 (~43 GB) with 6 concurrent copies
- Llama 4 Scout 109B/17B bf16 (~218 GB tight) or Q4 (~63 GB) comfortable
- Mistral Small 3 / Magistral / Devstral Small (24B) bf16 (~40-50 tok/s single, published reference)
- Pixtral Large / Mistral Large 2 Q6-Q8 (~90-140 GB)
- Llama-3.1-Nemotron Ultra 253B Q4 (~119 GB)
- gpt-oss-120b MXFP4 (~80 GB via GGUF on Ada — note Ada upcast caveat)
- Cohere Command R+ 104B Q4 RAG stack
Vision-Language Models
Qwen3-VL-235B-A22B Q4; Qwen3-VL-32B; InternVL3.5-78B / 241B-A28B Q4 (~135 GB); Llama 3.2 90B Vision bf16 (~180 GB); Pixtral 12B; Molmo 72B; Gemma 3 12B/27B multimodal; GLM-4.6V full (106B bf16); MiniCPM-o 2.6. L40's NVENC/NVDEC is particularly useful for high-throughput VLM document / video pipelines.
Image generation
FLUX.1 [dev] / Kontext / Tools across multiple workers concurrently (~3.5 s per 1024x1024 image on single L40 fp8, published reference) — 6x ComfyUI worker farm possible; SD 3.5 Large; HunyuanImage-2.1 (17B) bf16; HunyuanDiT; Kolors 2.0; AuraFlow; OmniGen.
Video generation
Wan 2.2 T2V-A14B / I2V-A14B dual-expert bf16 (~54 GB, ~20-30 s per 4s clip at 720p, published reference); HunyuanVideo 13B bf16 both experts; Open-Sora 2.0 bf16; CogVideoX-5B; Mochi-1; LTX-Video; Pyramid Flow; NVIDIA Cosmos Predict 2. L40's hardware NVENC/NVDEC handles caption / moderation / transcode at scale alongside generation.
Audio / Speech / TTS
- ASR: Whisper v3 large / turbo; Parakeet-TDT 1.1B; Canary 1B; Qwen3-ASR; SenseVoice
- TTS: CosyVoice 2/3; Kokoro 82M; Stable Audio Open; XTTS v2; Step-Audio-EditX
- Realtime / S2S: Kyutai Moshi; Step-Audio 2 mini / R1; Qwen2.5-Omni-7B
Multi-model / multi-tenant serving
- Multi-model residency — Qwen3-235B Q4 + FLUX.1 + HunyuanVideo + Whisper-turbo + Moshi + embedder, all resident
- 6 concurrent 48 GB-class workloads (one per card): 6x Qwen3-VL-32B, or 6x FLUX.1 workers, or 6x ASR streams
- 6-way tensor-parallel for 200B+ MoE at Q4 with real context
- RAG pipelines — Command R+ / Qwen3 + reranker + embedder + image analysis on same host
Target workloads
- 24/7 production LLM inference backend — 100+ concurrent users on 200B+ MoE at Q4, ECC-protected
- Media-AI pipeline at enterprise scale — caption + moderation + thumbnail + transcode on 6 parallel streams via NVENC/NVDEC
- Multi-tenant SaaS where per-tenant isolation across physical cards matters
- RAG backend with Command R+ reader + reranker + embedder + vision fully resident
- Reliability-first pair replacing the 12x L40 Legacy — two K-AI 288 servers = 576 GB aggregate with independent failure domains
Published performance references
External references | Not measured on Kentino hardware
| Benchmark | Result |
|---|---|
| L40 per-card INT8 TOPS | 362 TOPS |
| L40 memory bandwidth | 864 GB/s per card |
| vLLM — Llama 3.3 70B AWQ INT4 on 2x L40 TP (single) | ~25-35 tok/s |
| vLLM — Llama 3.3 70B AWQ INT4 on 2x L40 TP (batch-16) | ~150-200 tok/s aggregate |
| llama.cpp — GLM-4.6 Q4 on 6x L40 (single) | ~12-18 tok/s |
| FLUX.1 [dev] on single L40 fp8 | ~3.5 s per 1024x1024 image |
Kentino will publish first-party numbers after the initial customer build.
Not ideal for
- fp8-native inference at full speed — Ada upcasts to bf16; use GGUF Q4/Q5 or AWQ/GPTQ int4 instead. For fp8 native see K-AI 384 Rome RTXPro6000 (Blackwell)
- Training large models from scratch (no NVLink)
- Budget single-user inference — 4x L4 or 4x 5080 is materially cheaper for small workloads
- Frontier 600B+ dense at Q4+ (require 576 GB+ pool — see 6x RTX Pro 6000)
Warranty and lead time
Build includes assembly, BIOS configuration, driver install, burn-in, memtest, and functional verification. Lead time depends on component availability, confirmed at order.
Recommended add-ons
- Upgrade RAM to 512 GB DDR4 (add 2x 64 GB — 2 DIMM slots open) for heavier KV budget
- 4 TB NVMe Gen4 x4 for model library staging
- Full 24U rack cabinet with managed PDU + online UPS (critical for 24/7 ECC workloads)
- Paired second K-AI 288 unit — replaces the 12x L40 Legacy envelope with two independent failure domains
Share
