Přejít na informace o produktu
1 z 7

Kentino s.r.o.

K-AI 144 Rome L4 1452TOPS — 6× NVIDIA L4 — EPYC Milan

K-AI 144 Rome L4 1452TOPS — 6× NVIDIA L4 — EPYC Milan

Běžná cena €28.681,00 EUR
Běžná cena Výprodejová cena €28.681,00 EUR
Sleva Vyprodáno
Včetně daní. Poštovné se vypočítá na pokladně.

K-AI 144 Rome L4 1452TOPS

144 GB VRAM Silent Edge Inference Server
6x NVIDIA L4 Passive | EPYC Milan | 1 452 TOPS INT8

1 452
INT8 TOPS
144 GB
VRAM pool
432 W
GPU envelope
silent
passive GPUs

Six passive L4 datacenter cards. Quietest AI server in Kentino's lineup — acceptable for office-edge deployment.

A 4U single-socket inference server with six passive NVIDIA L4 cards (24 GB each, 144 GB pool), one AMD EPYC 7643 Milan CPU (48C/96T), 384 GB DDR4 ECC, 2 TB NVMe boot, and a single 2 kW ATX PSU with 62 % headroom. Dense-edge inference workhorse for embedding fleets, multi-tenant small/mid-size LLM serving, and watts-per-query deployments near office space.

Hardware

Component Detail
GPUs 6x NVIDIA L4 24 GB (Ada Lovelace, passive, 72 W, single-slot LP, PCIe Gen4 x8)
VRAM pool 144 GB aggregate across 6 cards
CPU AMD EPYC 7643 Milan (48C/96T, 225 W, 128 PCIe 4.0 lanes)
Motherboard ASRock Rack ROMED8-2T (SP3, 7x PCIe 4.0 x16, 8x DDR4 ECC, 2x 10 GbE, IPMI)
System RAM 384 GB DDR4-2666 ECC RDIMM (6x 64 GB)
Boot / storage 2 TB NVMe M.2 (PCIe 4.0 x4)
Power supply 1x 2 kW ATX PSU
Chassis 4U rack-mount (6-card layout)
Cooling SP3 tower cooler + front-to-back directed airflow (industrial fans)
Network Onboard dual 10 GbE (Intel X550)

Power envelope

  • GPU draw: 6 x 72 W = 432 W
  • System total at full load: ~757 W
  • PSU total: 2 000 W — 62 % headroom
  • Silent operation, massive thermal margin

Lane topology

L4 is PCIe Gen4 x8 native — no bandwidth loss vs host. ROMED8-2T provides 7x x16 slots; one slot left free for NIC upsell. No PCIe switch required. No NVLink.

What you can run

At 144 GB aggregate across 6 physical cards, the sweet spot is concurrent multi-model serving: run a 70B dense at Q4, a 30B MoE, a 14B coder, a VLM and an embedding model concurrently and still have KV headroom.

LLMs — text / reasoning / coding

Chinese frontier

  • Qwen3 / Qwen3.5 (Alibaba): Qwen3-30B-A3B Q4-Q6; QwQ-32B Q6; Qwen3-32B dense Q6; Qwen3.5-122B-A10B Q4-Q5 (~75 GB comfortable); Qwen3-235B-A22B Q3 (~112 GB) tight, short ctx
  • DeepSeek: DeepSeek-R2 32B sparse MoE Q4-Q6 (single-card capable, 6x concurrent streams, ~15-20 tok/s per stream); Seed-OSS-36B Q4-Q6 with 512k native context
  • GLM / Z.ai: GLM-4.5-Air Q4-Q5 (60-70 GB comfortable); Hunyuan-A13B Q4-Q6 (~48 GB)
  • Baidu ERNIE-4.5-47B-A3B Q4; Step-3.5-Flash Q3-Q4 with some RAM spill

Western frontier

  • Meta Llama: Llama 3.3 70B Q4-Q6 (43-58 GB) with generous KV (~10-17 tok/s single-stream across 6x L4 tensor-parallel); Llama 4 Scout 109B/17B MoE Q4 (~63 GB) comfortable
  • Mistral: Mistral Small 3 / Magistral Small 1.2 / Devstral Small 2 (24B) at bf16 (~50-65 tok/s per L4 card); Mixtral 8x22B Q4
  • OpenAI (open weights): gpt-oss-120b MXFP4 native (~80 GB) with room to spare; gpt-oss-20b MXFP4
  • Google Gemma 3: 27B bf16; Phi-4 14B bf16
  • NVIDIA Nemotron: Llama-3.1-Nemotron Super 49B Q4-Q6; Pixtral 12B / Pixtral Large Q4 (~72 GB)

Vision-Language Models

Qwen3-VL-8B/32B, Qwen3-VL-30B-A3B MoE, InternVL3 up to 78B Q4 (~48 GB), InternVL3.5-38B, DeepSeek-VL2, Llama 3.2 11B Vision bf16, Llama 3.2 90B Vision Q4 (~52 GB), Molmo 72B Q4, Gemma 3 12B/27B multimodal, MiniCPM-V 2.6 / MiniCPM-o 2.6, GLM-4.6V-Flash.

Image generation

FLUX.1 [dev] / [schnell] fp8 (~20-35 s/image on single L4 at fp8); FLUX.1 Kontext [dev]; FLUX Tools; SD 3.5 Large (18 GB fp16 / 11 GB fp8); SDXL 1.0; HunyuanImage-2.1 (~34 GB bf16); HunyuanDiT; Kolors 2.0; AuraFlow v0.3; OmniGen v1; PixArt-Sigma.

Video generation

Wan 2.2 T2V-A14B / I2V-A14B MoE (tight at bf16 ~54 GB); Wan 2.2 TI2V-5B fast path; HunyuanVideo 13B Q4-Q8 (~30 GB); HunyuanVideo 1.5 (8.3B); CogVideoX-5B; Open-Sora 2.0 Q8 (~16 GB); Mochi-1 Q4 (~18 GB); LTX-Video; Pyramid Flow; SVD / SV3D / SV4D; NVIDIA Cosmos.

Audio / Speech / TTS

  • ASR: Whisper v3 large / turbo (~50x realtime); Parakeet-TDT; Canary 1B; Qwen3-ASR; SenseVoice
  • TTS: CosyVoice 2 / 3; Kokoro 82M; Stable Audio Open; XTTS v2; StyleTTS 2; Step-Audio-EditX
  • Realtime / S2S: Kyutai Moshi 7B; Step-Audio 2 mini/R1; Qwen2.5-Omni-7B
  • Music / SFX: MusicGen / AudioGen / Bark; SeamlessM4T v2

Multi-model / multi-tenant serving

  • 6 concurrent streams of a 24 GB Q4 model (one per card): e.g. 6x Qwen3-14B Q4 agents
  • Mixed fleet: Llama 3.3 70B Q4 (tensor-parallel over 2 cards) + FLUX.1 (1 card) + Whisper-turbo (1 card) + Moshi (1 card) + BGE-M3 embedder (1 card)
  • Embedding service at high QPS — 6x parallel embed streams of BGE-M3 / E5 / Nomic / Cohere Embed
  • Video transcode farm — 6x parallel NVENC/NVDEC streams

Target workloads

  • SaaS multi-tenant LLM API — serve 20-40 concurrent users across a 24B/32B model with room for image and ASR alongside
  • RAG backend — query-side embedder + 70B Q4 reader + reranker, sub-second latency, 50 QPS
  • Video-AI pipeline — live transcode + caption + moderation on 6 parallel streams
  • Edge AI appliance near the office — low acoustic profile, zero datacenter dependency
  • Mid-tier model R&D bench — rapid iteration on 30-70B fine-tunes, one card per experiment

Measured performance

Published references | NVIDIA L4 datasheet + community benchmarks

Benchmark Result
Per-card INT8 TOPS (NVIDIA datasheet) 242 TOPS
Aggregate INT8 TOPS (6 cards) 1 452 TOPS
Llama 3.1 8B Q4 on single L4 (community) ~35-45 tok/s single-stream
BGE-M3 embedding QPS on L4 (community) ~800 QPS at 512-token input
Whisper v3 turbo realtime factor ~1.5-2x realtime per card

Published external references, not measured on Kentino hardware. Kentino will publish first-party numbers after the first customer build.

Not ideal for

  • Frontier 200B+ MoE at Q4+ with long context — 4x L40 or 8x RTX 4090 (192 GB pool, contiguous TP) is the right fit
  • Training workloads — L4 lacks FP8 and bandwidth for efficient training
  • Single-workload peak throughput — per-card compute is modest vs L40 / RTX Pro 6000

Warranty and lead time

2 years
parts warranty
1 year
labor warranty
10-28 days
lead time

NVIDIA OEM 3-year warranty on L4 + Kentino integration warranty. Build includes assembly, BIOS configuration, driver install, burn-in testing, and functional verification. Lead time depends on component availability, confirmed at order.

Recommended add-ons

  • 4 TB NVMe upgrade for model library staging
  • 24U open rack cabinet with managed PDU
Zobrazit veškeré podrobnosti