Kentino s.r.o.
K-AI 144 Rome L4 1452TOPS — 6× NVIDIA L4 — EPYC Milan
K-AI 144 Rome L4 1452TOPS — 6× NVIDIA L4 — EPYC Milan
Dostupnost vyzvednutí nebylo možné načíst
K-AI 144 Rome L4 1452TOPS
144 GB VRAM Silent Edge Inference Server
6x NVIDIA L4 Passive | EPYC Milan | 1 452 TOPS INT8
Six passive L4 datacenter cards. Quietest AI server in Kentino's lineup — acceptable for office-edge deployment.
A 4U single-socket inference server with six passive NVIDIA L4 cards (24 GB each, 144 GB pool), one AMD EPYC 7643 Milan CPU (48C/96T), 384 GB DDR4 ECC, 2 TB NVMe boot, and a single 2 kW ATX PSU with 62 % headroom. Dense-edge inference workhorse for embedding fleets, multi-tenant small/mid-size LLM serving, and watts-per-query deployments near office space.
Hardware
| Component | Detail |
|---|---|
| GPUs | 6x NVIDIA L4 24 GB (Ada Lovelace, passive, 72 W, single-slot LP, PCIe Gen4 x8) |
| VRAM pool | 144 GB aggregate across 6 cards |
| CPU | AMD EPYC 7643 Milan (48C/96T, 225 W, 128 PCIe 4.0 lanes) |
| Motherboard | ASRock Rack ROMED8-2T (SP3, 7x PCIe 4.0 x16, 8x DDR4 ECC, 2x 10 GbE, IPMI) |
| System RAM | 384 GB DDR4-2666 ECC RDIMM (6x 64 GB) |
| Boot / storage | 2 TB NVMe M.2 (PCIe 4.0 x4) |
| Power supply | 1x 2 kW ATX PSU |
| Chassis | 4U rack-mount (6-card layout) |
| Cooling | SP3 tower cooler + front-to-back directed airflow (industrial fans) |
| Network | Onboard dual 10 GbE (Intel X550) |
Power envelope
- GPU draw: 6 x 72 W = 432 W
- System total at full load: ~757 W
- PSU total: 2 000 W — 62 % headroom
- Silent operation, massive thermal margin
Lane topology
L4 is PCIe Gen4 x8 native — no bandwidth loss vs host. ROMED8-2T provides 7x x16 slots; one slot left free for NIC upsell. No PCIe switch required. No NVLink.
What you can run
At 144 GB aggregate across 6 physical cards, the sweet spot is concurrent multi-model serving: run a 70B dense at Q4, a 30B MoE, a 14B coder, a VLM and an embedding model concurrently and still have KV headroom.
LLMs — text / reasoning / coding
Chinese frontier
- Qwen3 / Qwen3.5 (Alibaba): Qwen3-30B-A3B Q4-Q6; QwQ-32B Q6; Qwen3-32B dense Q6; Qwen3.5-122B-A10B Q4-Q5 (~75 GB comfortable); Qwen3-235B-A22B Q3 (~112 GB) tight, short ctx
- DeepSeek: DeepSeek-R2 32B sparse MoE Q4-Q6 (single-card capable, 6x concurrent streams, ~15-20 tok/s per stream); Seed-OSS-36B Q4-Q6 with 512k native context
- GLM / Z.ai: GLM-4.5-Air Q4-Q5 (60-70 GB comfortable); Hunyuan-A13B Q4-Q6 (~48 GB)
- Baidu ERNIE-4.5-47B-A3B Q4; Step-3.5-Flash Q3-Q4 with some RAM spill
Western frontier
- Meta Llama: Llama 3.3 70B Q4-Q6 (43-58 GB) with generous KV (~10-17 tok/s single-stream across 6x L4 tensor-parallel); Llama 4 Scout 109B/17B MoE Q4 (~63 GB) comfortable
- Mistral: Mistral Small 3 / Magistral Small 1.2 / Devstral Small 2 (24B) at bf16 (~50-65 tok/s per L4 card); Mixtral 8x22B Q4
- OpenAI (open weights): gpt-oss-120b MXFP4 native (~80 GB) with room to spare; gpt-oss-20b MXFP4
- Google Gemma 3: 27B bf16; Phi-4 14B bf16
- NVIDIA Nemotron: Llama-3.1-Nemotron Super 49B Q4-Q6; Pixtral 12B / Pixtral Large Q4 (~72 GB)
Vision-Language Models
Qwen3-VL-8B/32B, Qwen3-VL-30B-A3B MoE, InternVL3 up to 78B Q4 (~48 GB), InternVL3.5-38B, DeepSeek-VL2, Llama 3.2 11B Vision bf16, Llama 3.2 90B Vision Q4 (~52 GB), Molmo 72B Q4, Gemma 3 12B/27B multimodal, MiniCPM-V 2.6 / MiniCPM-o 2.6, GLM-4.6V-Flash.
Image generation
FLUX.1 [dev] / [schnell] fp8 (~20-35 s/image on single L4 at fp8); FLUX.1 Kontext [dev]; FLUX Tools; SD 3.5 Large (18 GB fp16 / 11 GB fp8); SDXL 1.0; HunyuanImage-2.1 (~34 GB bf16); HunyuanDiT; Kolors 2.0; AuraFlow v0.3; OmniGen v1; PixArt-Sigma.
Video generation
Wan 2.2 T2V-A14B / I2V-A14B MoE (tight at bf16 ~54 GB); Wan 2.2 TI2V-5B fast path; HunyuanVideo 13B Q4-Q8 (~30 GB); HunyuanVideo 1.5 (8.3B); CogVideoX-5B; Open-Sora 2.0 Q8 (~16 GB); Mochi-1 Q4 (~18 GB); LTX-Video; Pyramid Flow; SVD / SV3D / SV4D; NVIDIA Cosmos.
Audio / Speech / TTS
- ASR: Whisper v3 large / turbo (~50x realtime); Parakeet-TDT; Canary 1B; Qwen3-ASR; SenseVoice
- TTS: CosyVoice 2 / 3; Kokoro 82M; Stable Audio Open; XTTS v2; StyleTTS 2; Step-Audio-EditX
- Realtime / S2S: Kyutai Moshi 7B; Step-Audio 2 mini/R1; Qwen2.5-Omni-7B
- Music / SFX: MusicGen / AudioGen / Bark; SeamlessM4T v2
Multi-model / multi-tenant serving
- 6 concurrent streams of a 24 GB Q4 model (one per card): e.g. 6x Qwen3-14B Q4 agents
- Mixed fleet: Llama 3.3 70B Q4 (tensor-parallel over 2 cards) + FLUX.1 (1 card) + Whisper-turbo (1 card) + Moshi (1 card) + BGE-M3 embedder (1 card)
- Embedding service at high QPS — 6x parallel embed streams of BGE-M3 / E5 / Nomic / Cohere Embed
- Video transcode farm — 6x parallel NVENC/NVDEC streams
Target workloads
- SaaS multi-tenant LLM API — serve 20-40 concurrent users across a 24B/32B model with room for image and ASR alongside
- RAG backend — query-side embedder + 70B Q4 reader + reranker, sub-second latency, 50 QPS
- Video-AI pipeline — live transcode + caption + moderation on 6 parallel streams
- Edge AI appliance near the office — low acoustic profile, zero datacenter dependency
- Mid-tier model R&D bench — rapid iteration on 30-70B fine-tunes, one card per experiment
Measured performance
Published references | NVIDIA L4 datasheet + community benchmarks
| Benchmark | Result |
|---|---|
| Per-card INT8 TOPS (NVIDIA datasheet) | 242 TOPS |
| Aggregate INT8 TOPS (6 cards) | 1 452 TOPS |
| Llama 3.1 8B Q4 on single L4 (community) | ~35-45 tok/s single-stream |
| BGE-M3 embedding QPS on L4 (community) | ~800 QPS at 512-token input |
| Whisper v3 turbo realtime factor | ~1.5-2x realtime per card |
Published external references, not measured on Kentino hardware. Kentino will publish first-party numbers after the first customer build.
Not ideal for
- Frontier 200B+ MoE at Q4+ with long context — 4x L40 or 8x RTX 4090 (192 GB pool, contiguous TP) is the right fit
- Training workloads — L4 lacks FP8 and bandwidth for efficient training
- Single-workload peak throughput — per-card compute is modest vs L40 / RTX Pro 6000
Warranty and lead time
NVIDIA OEM 3-year warranty on L4 + Kentino integration warranty. Build includes assembly, BIOS configuration, driver install, burn-in testing, and functional verification. Lead time depends on component availability, confirmed at order.
Recommended add-ons
- 4 TB NVMe upgrade for model library staging
- 24U open rack cabinet with managed PDU
Share
