Kentino s.r.o.

K-AI 144 Rome L4 1452TOPS — 6× NVIDIA L4 — EPYC Milan

Name: K-AI 144 Rome L4 1452TOPS — 6× NVIDIA L4 — EPYC Milan
Brand: Kentino s.r.o.
Price: 28681.00 EUR
Availability: InStock

€28.681,00 EUR

Sleva Vyprodáno

Včetně daní. Poštovné se vypočítá na pokladně.

Množství

K-AI 144 Rome L4 1452TOPS

144 GB VRAM Silent Edge Inference Server
6x NVIDIA L4 Passive | EPYC Milan | 1 452 TOPS INT8

1 452

INT8 TOPS

144 GB

VRAM pool

432 W

GPU envelope

silent

passive GPUs

Six passive L4 datacenter cards. Quietest AI server in Kentino's lineup — acceptable for office-edge deployment.

A 4U single-socket inference server with six passive NVIDIA L4 cards (24 GB each, 144 GB pool), one AMD EPYC 7643 Milan CPU (48C/96T), 384 GB DDR4 ECC, 2 TB NVMe boot, and a single 2 kW ATX PSU with 62 % headroom. Dense-edge inference workhorse for embedding fleets, multi-tenant small/mid-size LLM serving, and watts-per-query deployments near office space.

Hardware

Component	Detail
GPUs	6x NVIDIA L4 24 GB (Ada Lovelace, passive, 72 W, single-slot LP, PCIe Gen4 x8)
VRAM pool	144 GB aggregate across 6 cards
CPU	AMD EPYC 7643 Milan (48C/96T, 225 W, 128 PCIe 4.0 lanes)
Motherboard	ASRock Rack ROMED8-2T (SP3, 7x PCIe 4.0 x16, 8x DDR4 ECC, 2x 10 GbE, IPMI)
System RAM	384 GB DDR4-2666 ECC RDIMM (6x 64 GB)
Boot / storage	2 TB NVMe M.2 (PCIe 4.0 x4)
Power supply	1x 2 kW ATX PSU
Chassis	4U rack-mount (6-card layout)
Cooling	SP3 tower cooler + front-to-back directed airflow (industrial fans)
Network	Onboard dual 10 GbE (Intel X550)

Power envelope

GPU draw: 6 x 72 W = 432 W
System total at full load: ~757 W
PSU total: 2 000 W — 62 % headroom
Silent operation, massive thermal margin

Lane topology

L4 is PCIe Gen4 x8 native — no bandwidth loss vs host. ROMED8-2T provides 7x x16 slots; one slot left free for NIC upsell. No PCIe switch required. No NVLink.

What you can run

At 144 GB aggregate across 6 physical cards, the sweet spot is concurrent multi-model serving: run a 70B dense at Q4, a 30B MoE, a 14B coder, a VLM and an embedding model concurrently and still have KV headroom.

LLMs — text / reasoning / coding

Chinese frontier

Qwen3 / Qwen3.5 (Alibaba): Qwen3-30B-A3B Q4-Q6; QwQ-32B Q6; Qwen3-32B dense Q6; Qwen3.5-122B-A10B Q4-Q5 (~75 GB comfortable); Qwen3-235B-A22B Q3 (~112 GB) tight, short ctx
DeepSeek: DeepSeek-R2 32B sparse MoE Q4-Q6 (single-card capable, 6x concurrent streams, ~15-20 tok/s per stream); Seed-OSS-36B Q4-Q6 with 512k native context
GLM / Z.ai: GLM-4.5-Air Q4-Q5 (60-70 GB comfortable); Hunyuan-A13B Q4-Q6 (~48 GB)
Baidu ERNIE-4.5-47B-A3B Q4; Step-3.5-Flash Q3-Q4 with some RAM spill

Western frontier

Meta Llama: Llama 3.3 70B Q4-Q6 (43-58 GB) with generous KV (~10-17 tok/s single-stream across 6x L4 tensor-parallel); Llama 4 Scout 109B/17B MoE Q4 (~63 GB) comfortable
Mistral: Mistral Small 3 / Magistral Small 1.2 / Devstral Small 2 (24B) at bf16 (~50-65 tok/s per L4 card); Mixtral 8x22B Q4
OpenAI (open weights): gpt-oss-120b MXFP4 native (~80 GB) with room to spare; gpt-oss-20b MXFP4
Google Gemma 3: 27B bf16; Phi-4 14B bf16
NVIDIA Nemotron: Llama-3.1-Nemotron Super 49B Q4-Q6; Pixtral 12B / Pixtral Large Q4 (~72 GB)

Vision-Language Models

Qwen3-VL-8B/32B, Qwen3-VL-30B-A3B MoE, InternVL3 up to 78B Q4 (~48 GB), InternVL3.5-38B, DeepSeek-VL2, Llama 3.2 11B Vision bf16, Llama 3.2 90B Vision Q4 (~52 GB), Molmo 72B Q4, Gemma 3 12B/27B multimodal, MiniCPM-V 2.6 / MiniCPM-o 2.6, GLM-4.6V-Flash.

Image generation

FLUX.1 [dev] / [schnell] fp8 (~20-35 s/image on single L4 at fp8); FLUX.1 Kontext [dev]; FLUX Tools; SD 3.5 Large (18 GB fp16 / 11 GB fp8); SDXL 1.0; HunyuanImage-2.1 (~34 GB bf16); HunyuanDiT; Kolors 2.0; AuraFlow v0.3; OmniGen v1; PixArt-Sigma.

Video generation

Wan 2.2 T2V-A14B / I2V-A14B MoE (tight at bf16 ~54 GB); Wan 2.2 TI2V-5B fast path; HunyuanVideo 13B Q4-Q8 (~30 GB); HunyuanVideo 1.5 (8.3B); CogVideoX-5B; Open-Sora 2.0 Q8 (~16 GB); Mochi-1 Q4 (~18 GB); LTX-Video; Pyramid Flow; SVD / SV3D / SV4D; NVIDIA Cosmos.

Audio / Speech / TTS

ASR: Whisper v3 large / turbo (~50x realtime); Parakeet-TDT; Canary 1B; Qwen3-ASR; SenseVoice
TTS: CosyVoice 2 / 3; Kokoro 82M; Stable Audio Open; XTTS v2; StyleTTS 2; Step-Audio-EditX
Realtime / S2S: Kyutai Moshi 7B; Step-Audio 2 mini/R1; Qwen2.5-Omni-7B
Music / SFX: MusicGen / AudioGen / Bark; SeamlessM4T v2

Multi-model / multi-tenant serving

6 concurrent streams of a 24 GB Q4 model (one per card): e.g. 6x Qwen3-14B Q4 agents
Mixed fleet: Llama 3.3 70B Q4 (tensor-parallel over 2 cards) + FLUX.1 (1 card) + Whisper-turbo (1 card) + Moshi (1 card) + BGE-M3 embedder (1 card)
Embedding service at high QPS — 6x parallel embed streams of BGE-M3 / E5 / Nomic / Cohere Embed
Video transcode farm — 6x parallel NVENC/NVDEC streams

Target workloads

SaaS multi-tenant LLM API — serve 20-40 concurrent users across a 24B/32B model with room for image and ASR alongside
RAG backend — query-side embedder + 70B Q4 reader + reranker, sub-second latency, 50 QPS
Video-AI pipeline — live transcode + caption + moderation on 6 parallel streams
Edge AI appliance near the office — low acoustic profile, zero datacenter dependency
Mid-tier model R&D bench — rapid iteration on 30-70B fine-tunes, one card per experiment

Measured performance

Published references | NVIDIA L4 datasheet + community benchmarks

Benchmark	Result
Per-card INT8 TOPS (NVIDIA datasheet)	242 TOPS
Aggregate INT8 TOPS (6 cards)	1 452 TOPS
Llama 3.1 8B Q4 on single L4 (community)	~35-45 tok/s single-stream
BGE-M3 embedding QPS on L4 (community)	~800 QPS at 512-token input
Whisper v3 turbo realtime factor	~1.5-2x realtime per card

Published external references, not measured on Kentino hardware. Kentino will publish first-party numbers after the first customer build.

Not ideal for

Frontier 200B+ MoE at Q4+ with long context — 4x L40 or 8x RTX 4090 (192 GB pool, contiguous TP) is the right fit
Training workloads — L4 lacks FP8 and bandwidth for efficient training
Single-workload peak throughput — per-card compute is modest vs L40 / RTX Pro 6000

Warranty and lead time

2 years

parts warranty

1 year

labor warranty

10-28 days

lead time

NVIDIA OEM 3-year warranty on L4 + Kentino integration warranty. Build includes assembly, BIOS configuration, driver install, burn-in testing, and functional verification. Lead time depends on component availability, confirmed at order.