Ga direct naar productinformatie
1 van 7

Kentino s.r.o.

K-AI 192 Rome L40 1448TOPS — 4× NVIDIA L40 — EPYC Milan

K-AI 192 Rome L40 1448TOPS — 4× NVIDIA L40 — EPYC Milan

Normale prijs €40.798,00 EUR
Normale prijs Aanbiedingsprijs €40.798,00 EUR
Aanbieding Uitverkocht
Belastingen inbegrepen. Verzendkosten worden berekend bij de checkout.

K-AI 192 Rome L40 1448TOPS

192 GB ECC Enterprise Inference Server
4x NVIDIA L40 Passive | EPYC Milan | 1 448 TOPS INT8

1 448
INT8 TOPS
192 GB
ECC VRAM
ECC
datacenter grade
24/7
passive cooled

Four passive L40 datacenter cards with ECC memory. Same 192 GB pool as 8x RTX 4090 — but datacenter-grade, ECC-protected, and OEM-warrantied.

A 4U rack-mount inference server with four passive NVIDIA L40 cards pooled to 192 GB ECC VRAM, one AMD EPYC 7643 Milan CPU (48C/96T), 256 GB DDR4 ECC, 2 TB NVMe boot, and dual synchronized 2 kW ATX PSU. The L40 is the datacenter sibling of the RTX 4090 — passive-cooled, ECC-equipped, NVENC/NVDEC hardware encoders on-die, and NVIDIA OEM 3-year warranty. Runs vLLM, SGLang, llama.cpp, Triton, TensorRT-LLM out of the box.

Hardware

Component Detail
GPUs 4x NVIDIA L40 48 GB ECC GDDR6 (Ada Lovelace, passive, 300 W, dual-slot, PCIe 4.0 x16)
VRAM pool 192 GB ECC across 4 cards (no NVLink on L40)
CPU AMD EPYC 7643 Milan (48C/96T, 225 W, 128x PCIe 4.0 lanes)
Motherboard ASRock Rack ROMED8-2T (SP3, 7x PCIe 4.0 x16, 8x DDR4 ECC, 2x 10 GbE, IPMI)
System RAM 256 GB DDR4-2666 ECC RDIMM (4x 64 GB)
Boot / storage 2 TB NVMe M.2 (PCIe 4.0 x4)
Power supply Dual 2 kW ATX PSU with sync cable
Chassis 4U rack-mount with front-to-back directed airflow
Cooling Arctic Freezer 4U-M SP3 tower + 3x 120 mm front intake + 1x 120 mm rear exhaust
Network Onboard dual 10 GbE (Intel X550)

Power envelope

  • GPU draw: 4 x 300 W = 1 200 W
  • System total at full load: ~1 525 W
  • PSU total: 4 000 W (dual 2 kW synced) — 61.9 % headroom
  • Dual PSU for split power delivery and N+1 capability

Lane topology

PCIe Gen4 x16 per card (L40 is Gen4 native). Direct root-complex connection from single EPYC — no PCIe switch. No NVLink — inter-GPU traffic runs PCIe peer-to-peer. Three x16 slots remain for NIC / storage expansion.

What you can run

With 192 GB of ECC VRAM across 4 datacenter cards, this server handles 200B+ frontier MoE at Q4, enterprise multi-tenant serving with strict SLA, and 24/7 production inference without ECC-related bit-flip drift.

LLMs — text / reasoning / coding

Chinese frontier

  • Qwen3 / Qwen3.5 (Alibaba): Qwen3-235B-A22B Q4 (~132 GB) with long context — the hero config (~12-18 tok/s single-stream across 4x L40); Qwen3-Coder-480B-A35B Q2 (~160 GB, tight); Qwen3.5-122B-A10B fp8 (~75 GB) with huge KV; Qwen3-32B dense bf16 multiple concurrent streams
  • DeepSeek: DeepSeek-V3/R1/V3.1/V3.2 Q2 (~215 GB with minor RAM spill); DeepSeek-R2 32B — 4x concurrent streams, one per card
  • GLM / Z.ai: GLM-4.5 / 4.6 / 4.7 Q4 (~177 GB) — sweet spot for this tier; GLM-4.5-Air 106B/12B fp8 or bf16
  • Tencent Hunyuan: Hunyuan-Large Q3 (~160 GB) — 389B MoE with 256k ctx; Hunyuan-A13B fp8 (~80 GB) with huge KV
  • Baidu ERNIE-4.5-424B Q3 (~180 GB); InternVL3.5-241B-A28B Q4 (~135 GB); Qwen3.5-397B Q3 (~170 GB)

Western frontier

  • Meta Llama: Llama 3.3 70B bf16 with massive KV (~15-18 tok/s single-stream on 4x L40); Llama 4 Scout bf16 (~218 GB) tight; Llama 4 Maverick 400B/17B Q3 (~188 GB)
  • Mistral: Mistral Large 2 / Pixtral Large / Devstral 2 123B Q6 (~102 GB) comfortable; Mistral Small 3 multi-stream
  • OpenAI (open weights): gpt-oss-120b MXFP4 (80 GB) with generous KV
  • NVIDIA Nemotron: Llama-3.1-Nemotron Ultra 253B Q4 (~147 GB); Super 49B bf16 multiple streams
  • Google Gemma 3: 27B multimodal bf16 — multiple resident streams
  • Others: Cohere Command R+ 104B Q6 (~85 GB); OLMo 3.1 32B; Reka Flash 3 21B; IBM Granite 4.0 H-Small

Vision-Language Models

InternVL3.5-241B-A28B Q4 (~135 GB); Qwen3-VL-235B-A22B Q4; Qwen3-VL-32B bf16; Llama 3.2 90B Vision bf16 (~180 GB); Pixtral Large 124B Q6-bf16; Molmo 72B bf16; GLM-4.6V 106B fp8; Gemma 3 27B multimodal multiple streams; InternVL3 78B bf16; DeepSeek-VL2 full range.

Image generation

FLUX.1 [dev] / [schnell] bf16 with concurrent generation (~3-4 s per 1024x1024 image on L40); FLUX.1 Kontext [dev]; FLUX Tools; SD 3.5 Large bf16 x 2-3 concurrent; HunyuanImage-2.1 bf16 (~34 GB) multi-stream; HunyuanImage-3.0 base (80B MoE, 13B active) bf16 (~80 GB); HunyuanDiT; Kolors / Kolors 2.0; AuraFlow; OmniGen v1; PixArt-Sigma.

Video generation

Wan 2.2 T2V-A14B / I2V-A14B MoE bf16 dual-expert full-context; Wan 2.2 TI2V-5B fast path; HunyuanVideo 13B bf16 both experts; HunyuanVideo 1.5; CogVideoX-5B bf16; Open-Sora 2.0 11B bf16; Mochi-1 bf16 (~42 GB) multi-stream; LTX-Video; Pyramid Flow; SVD / SV3D / SV4D; NVIDIA Cosmos Predict 2.

Audio / Speech / TTS

  • ASR: Whisper v3 large / turbo (~50x realtime); Parakeet-TDT; Canary 1B; Qwen3-ASR; SenseVoice
  • TTS: CosyVoice 2/3; Kokoro 82M; XTTS v2; Stable Audio Open; Step-Audio-EditX
  • Realtime / S2S: Kyutai Moshi 7B; Step-Audio 2 mini/R1; Qwen2.5-Omni-7B
  • Music / SFX: MusicGen / AudioGen / Bark; SeamlessM4T v2

Multi-model / multi-tenant serving

  • Enterprise production LLM gateway — Qwen3-235B Q4 or GLM-4.5/4.6 Q4 serving 16-32 concurrent users with strict SLA
  • Mixed resident stack: 235B MoE + FLUX.1 + Whisper-turbo + Moshi with partitioned VRAM and ECC protection
  • Live video + AI pipeline — NVENC/NVDEC hardware encoders stream 6-8 parallel captioning + moderation pipelines
  • Multi-tenant RAG — query-side embedder + 70B reader + reranker at sub-second P99 latency

Target workloads

  • 24/7 production LLM inference at 192 GB pool (Qwen3-235B Q4, GLM-4.5/4.6/4.7 Q4, Llama 4 Scout bf16)
  • Enterprise multi-tenant serving with strict SLA — ECC reliability over long runs
  • RAG + vector DB serving with high-quality retrieval models concurrent
  • Media / video AI pipelines — NVENC / NVDEC hardware path, VFX rendering, transcribe/translate
  • Datacenter silent-operation deployments — passive cards, low acoustic profile near office space

Measured performance

Published references | NVIDIA L40 datasheet + community benchmarks

Benchmark Result
Per-card INT8 TOPS (NVIDIA datasheet) 362 TOPS
Aggregate INT8 TOPS (4 cards) 1 448 TOPS
Per-card VRAM 48 GB ECC GDDR6, 864 GB/s bandwidth
Llama 3.3 70B Q6 via vLLM (community) 30-50 tok/s single-stream, 150+ tok/s batch-16
FLUX.1 [dev] bf16 on L40 (community) ~3-4 s per 1024x1024 image
NVENC / NVDEC Gen-8 hardware encoders on-die (video AI pipeline)

Published external references, not measured on Kentino hardware. Kentino will publish first-party numbers after the first customer build.

Not ideal for

  • Training large models from scratch (no NVLink, limited FP8 tensor compute)
  • Single-user budget inference (4x L4 or 2x L40 is materially cheaper)
  • Dense bf16 70B at very long context on one model — prefer 2x RTX Pro 6000 Server Edition (same 192 GB pool, less TP overhead)

Warranty and lead time

2 years
parts warranty
1 year
labor warranty
10-28 days
lead time

NVIDIA OEM 3-year warranty on L40 + Kentino integration warranty. Build includes assembly, BIOS configuration, driver install, burn-in testing, and functional verification. Lead time depends on component availability, confirmed at order.

Recommended add-ons

  • Upgrade RAM to 512 GB (add 4x 64 GB DDR4 — four DIMM slots still open)
  • 4 TB NVMe for model library staging
  • Full 24U rack cabinet with managed PDU + online UPS 5 kVA
Alle details bekijken