跳至产品信息
1 / 7

Kentino s.r.o.

K-AI 288 Rome L40 — 6× NVIDIA L40 Passive Enterprise (288 GB ECC VRAM)

K-AI 288 Rome L40 — 6× NVIDIA L40 Passive Enterprise (288 GB ECC VRAM)

常规价格 €59.490,00 EUR
常规价格 促销价 €59.490,00 EUR
促销 售罄
已含税费。 结账时计算的运费

K-AI 288 Rome L40 2172TOPS

288 GB ECC VRAM Enterprise Server
6x NVIDIA L40 Passive | EPYC Milan | 2 172 TOPS INT8

2 172
TOPS INT8
288 GB
ECC VRAM pool
ECC
end-to-end
24/7
production-rated

Published external references. Not measured on Kentino hardware.

A 4U rack-mount enterprise inference server with six NVIDIA L40 Ada Lovelace passive datacenter cards (48 GB ECC each) pooled to 288 GB ECC VRAM, one AMD EPYC 7643 Milan CPU (48C/96T), 384 GB DDR4-2666 ECC, 2 TB NVMe boot, and dual synchronized 2.5 kW ATX PSU. ECC end-to-end, purpose-built for 24/7 enterprise production where bit-level integrity and serviceable failure domains matter.

Hardware

Component Detail
GPUs 6x NVIDIA L40 48 GB ECC (Ada Lovelace, passive datacenter, 300 W, PCIe 4.0 x16, dual-slot, 362 INT8 TOPS/card)
VRAM pool 288 GB aggregate ECC across 6 cards (no NVLink on L40 PCIe SKU)
CPU AMD EPYC 7643 Milan (48C/96T, 225 W, 128x PCIe 4.0 lanes)
Motherboard ASRock Rack ROMED8-2T (SP3, 7x PCIe 4.0 x16, 8x DDR4 ECC, 2x 10 GbE, IPMI)
System RAM 384 GB DDR4-2666 ECC RDIMM (6x 64 GB — 2 DIMM slots open for upgrade to 512 GB)
Boot / storage 2 TB NVMe M.2 (PCIe 4.0 x4)
Power supply 2x 2.5 kW ATX with dual-PSU sync cable (5 kW aggregate)
Chassis 4U rack-mount (6-slot layout)
Cooling SP3 tower cooler (Arctic Freezer 4U-M class) + front-to-back directed airflow (industrial fans)
Network Onboard dual 10 GbE (Intel X550)

Power envelope

  • GPU draw: 6 x 300 W = 1 800 W
  • System total under full load: ~2 175 W
  • PSU total: 5 000 W (dual 2.5 kW synced) — 56.5% headroom
  • Dual PSU for split power delivery — single PSU failure = loss of 2 GPUs or 2 GPUs + motherboard

Lane topology

ROMED8-2T exposes 7x PCIe 4.0 x16 direct from EPYC Milan. Six slots populated with passive Gen4 x16 risers — one free slot for NIC / storage. No PCIe switch required. L40 native link is PCIe 4.0 x16 — no bandwidth loss. No NVLink; inter-GPU traffic runs PCIe peer-to-peer.

What you can run

With 288 GB of pooled ECC VRAM across 6 passive L40 cards, this server handles frontier open-weight LLMs at Q4, multi-model concurrent serving, video/media pipelines, and 24/7 enterprise production inference. Note: L40 is Ada Lovelace, not Blackwell — fp8 upcasts to bf16. Use GGUF Q4/Q5 or AWQ/GPTQ int4 for maximum VRAM efficiency.

LLMs — text / reasoning / coding

Chinese frontier

  • Qwen3-235B-A22B Q4 (~132 GB) with very long context + generous KV budget (~15-20 tok/s single, published reference)
  • GLM-4.5 / 4.6 / 4.7 Q4 (~177 GB) comfortable on 6-way TP (~12-18 tok/s single, published reference)
  • Hunyuan-Large 389B/52B Q3 (~160 GB); ERNIE-4.5-424B-A47B Q3 (~180 GB)
  • Qwen3-Coder-480B-A35B Q2 (~160 GB) flagship coding agent
  • MiniMax-M1 / Text-01 Q3 (~180 GB) 1M-ctx Lightning Attention
  • Qwen3-30B-A3B / QwQ-32B / Qwen3-32B — single-card with 6 parallel streams
  • DeepSeek-R2 32B sparse MoE — single card per stream, 6 concurrent sessions

Western frontier

  • Llama 3.3 70B bf16 (~142 GB) multi-tenant serving (~17 tok/s single, published reference), or Q4 (~43 GB) with 6 concurrent copies
  • Llama 4 Scout 109B/17B bf16 (~218 GB tight) or Q4 (~63 GB) comfortable
  • Mistral Small 3 / Magistral / Devstral Small (24B) bf16 (~40-50 tok/s single, published reference)
  • Pixtral Large / Mistral Large 2 Q6-Q8 (~90-140 GB)
  • Llama-3.1-Nemotron Ultra 253B Q4 (~119 GB)
  • gpt-oss-120b MXFP4 (~80 GB via GGUF on Ada — note Ada upcast caveat)
  • Cohere Command R+ 104B Q4 RAG stack

Vision-Language Models

Qwen3-VL-235B-A22B Q4; Qwen3-VL-32B; InternVL3.5-78B / 241B-A28B Q4 (~135 GB); Llama 3.2 90B Vision bf16 (~180 GB); Pixtral 12B; Molmo 72B; Gemma 3 12B/27B multimodal; GLM-4.6V full (106B bf16); MiniCPM-o 2.6. L40's NVENC/NVDEC is particularly useful for high-throughput VLM document / video pipelines.

Image generation

FLUX.1 [dev] / Kontext / Tools across multiple workers concurrently (~3.5 s per 1024x1024 image on single L40 fp8, published reference) — 6x ComfyUI worker farm possible; SD 3.5 Large; HunyuanImage-2.1 (17B) bf16; HunyuanDiT; Kolors 2.0; AuraFlow; OmniGen.

Video generation

Wan 2.2 T2V-A14B / I2V-A14B dual-expert bf16 (~54 GB, ~20-30 s per 4s clip at 720p, published reference); HunyuanVideo 13B bf16 both experts; Open-Sora 2.0 bf16; CogVideoX-5B; Mochi-1; LTX-Video; Pyramid Flow; NVIDIA Cosmos Predict 2. L40's hardware NVENC/NVDEC handles caption / moderation / transcode at scale alongside generation.

Audio / Speech / TTS

  • ASR: Whisper v3 large / turbo; Parakeet-TDT 1.1B; Canary 1B; Qwen3-ASR; SenseVoice
  • TTS: CosyVoice 2/3; Kokoro 82M; Stable Audio Open; XTTS v2; Step-Audio-EditX
  • Realtime / S2S: Kyutai Moshi; Step-Audio 2 mini / R1; Qwen2.5-Omni-7B

Multi-model / multi-tenant serving

  • Multi-model residency — Qwen3-235B Q4 + FLUX.1 + HunyuanVideo + Whisper-turbo + Moshi + embedder, all resident
  • 6 concurrent 48 GB-class workloads (one per card): 6x Qwen3-VL-32B, or 6x FLUX.1 workers, or 6x ASR streams
  • 6-way tensor-parallel for 200B+ MoE at Q4 with real context
  • RAG pipelines — Command R+ / Qwen3 + reranker + embedder + image analysis on same host

Target workloads

  • 24/7 production LLM inference backend — 100+ concurrent users on 200B+ MoE at Q4, ECC-protected
  • Media-AI pipeline at enterprise scale — caption + moderation + thumbnail + transcode on 6 parallel streams via NVENC/NVDEC
  • Multi-tenant SaaS where per-tenant isolation across physical cards matters
  • RAG backend with Command R+ reader + reranker + embedder + vision fully resident
  • Reliability-first pair replacing the 12x L40 Legacy — two K-AI 288 servers = 576 GB aggregate with independent failure domains

Published performance references

External references | Not measured on Kentino hardware

Benchmark Result
L40 per-card INT8 TOPS 362 TOPS
L40 memory bandwidth 864 GB/s per card
vLLM — Llama 3.3 70B AWQ INT4 on 2x L40 TP (single) ~25-35 tok/s
vLLM — Llama 3.3 70B AWQ INT4 on 2x L40 TP (batch-16) ~150-200 tok/s aggregate
llama.cpp — GLM-4.6 Q4 on 6x L40 (single) ~12-18 tok/s
FLUX.1 [dev] on single L40 fp8 ~3.5 s per 1024x1024 image

Kentino will publish first-party numbers after the initial customer build.

Not ideal for

  • fp8-native inference at full speed — Ada upcasts to bf16; use GGUF Q4/Q5 or AWQ/GPTQ int4 instead. For fp8 native see K-AI 384 Rome RTXPro6000 (Blackwell)
  • Training large models from scratch (no NVLink)
  • Budget single-user inference — 4x L4 or 4x 5080 is materially cheaper for small workloads
  • Frontier 600B+ dense at Q4+ (require 576 GB+ pool — see 6x RTX Pro 6000)

Warranty and lead time

3 years
NVIDIA OEM GPU warranty
2 years
parts warranty
1 year
labor warranty
10-28 days
lead time

Build includes assembly, BIOS configuration, driver install, burn-in, memtest, and functional verification. Lead time depends on component availability, confirmed at order.

Recommended add-ons

  • Upgrade RAM to 512 GB DDR4 (add 2x 64 GB — 2 DIMM slots open) for heavier KV budget
  • 4 TB NVMe Gen4 x4 for model library staging
  • Full 24U rack cabinet with managed PDU + online UPS (critical for 24/7 ECC workloads)
  • Paired second K-AI 288 unit — replaces the 12x L40 Legacy envelope with two independent failure domains
查看完整详细信息