Kentino s.r.o.

K-AI 48 Rome L4 484TOPS — 2x NVIDIA L4 Passive Edge AI Server

Name: K-AI 48 Rome L4 484TOPS — 2x NVIDIA L4 Passive Edge AI Server
Brand: Kentino s.r.o.
Price: 13100.00 EUR
Availability: InStock

€13.100,00 EUR

Разпродажба Изчерпано

Доставката се изчислява при плащане.

Количество

K-AI 48 Rome L4 484TOPS

Silent 2x L4 Passive Edge Server
48 GB ECC VRAM | EPYC Milan | 484 TOPS INT8

484

TOPS INT8

48 GB

ECC VRAM

144 W

GPU total

24/7

datacenter

Silent 2x L4 passive inference box — datacenter-grade warranty path, 72 W per card, 48 GB ECC VRAM for always-on edge deployment.

A 2-GPU edge inference server built around passive NVIDIA L4 cards — the datacenter-class silent option in the Kentino lineup. 48 GB total ECC VRAM, 144 W total GPU draw, single-slot card footprint, and airflow driven entirely by the chassis. For branch offices, broadcast facilities, always-on transcription, and any deployment where acoustic profile and a datacenter warranty path matter more than raw tensor throughput.

Hardware

Component	Detail
GPUs	2x NVIDIA L4 24 GB GDDR6 passive (72 W, PCIe 4.0 x16, Ada Lovelace, ECC)
VRAM pool	48 GB ECC
CPU	AMD EPYC 7643 Milan (48C/96T, 225 W, 128x PCIe 4.0 lanes)
Motherboard	ASRock Rack ROMED8-2T (SP3, 7x PCIe 4.0 x16, 8x DDR4 ECC, 2x 10 GbE, IPMI)
System RAM	128 GB DDR4-2666 ECC RDIMM (2x 64 GB)
Boot / storage	1 TB NVMe M.2 (PCIe 4.0 x4)
Power supply	Single 2 kW ATX PSU
Chassis	4U rack-mount, passive Gen4 x16 risers
Cooling	SP3 tower cooler, 3x 120 mm front intake + 1x 120 mm rear exhaust (low-RPM PWM)
Network	Onboard dual 10 GbE (Intel X550) + IPMI

Power envelope

GPU draw: 2 x 72 W = 144 W
System total at full load: ~469 W
PSU total: 2 000 W — 76.55 % headroom
Drives fans at idle-low RPM (~35 dBA idle, <45 dBA sustained inference)

Lane topology

PCIe Gen4 x16 at both GPUs. L4 is native Gen4 x16; ROMED8-2T fans out 2x16 directly from CPU. No switch, no NVLink. 55-65 C GPU temperature sustained — passive cards rely entirely on chassis airflow.

What you can run

With 48 GB of ECC VRAM across 2 passive L4 cards, this server handles always-on LLM inference, 24/7 ASR + TTS pipelines, VLM document processing, and edge deployments where silence and datacenter warranty matter.

LLMs — text / reasoning / coding

Chinese frontier

Qwen3-32B dense Q6 with 32k ctx (~15-20 tok/s single-stream on L4, published reference)
Qwen3-30B-A3B / Qwen3-Coder-30B-A3B Q4-Q6 (MoE, 256k ctx)
QwQ-32B Q6; DeepSeek-R2 32B sparse MoE Q4-Q6 (~18-24 tok/s single-stream at Q4 on L4, published reference)
Hunyuan-A13B Q6 or fp8 (~48 GB) — 80B/13B MoE, 256k ctx
Seed-OSS-36B Q4-Q6 — 512k native ctx
ERNIE-4.5-47B-A3B Q4-Q6 (~28-42 GB)

Western frontier

Llama 3.3 70B Q4_K_M (~43 GB) tensor-parallel 2-way (~8-12 tok/s single-stream on 2x L4, published reference)
Mistral Small 3 / Magistral / Devstral Small 2 (24B) bf16
Gemma 3 27B multimodal bf16
Phi-4 14B / Phi-4-reasoning bf16
Nemotron-Super 49B Q4 (~28 GB)
OLMo 2 32B / OLMo 3.1-32B-Think — fully open reasoning research

Vision-Language

Qwen3-VL-8B / 32B Q4-Q6; InternVL3.5-38B Q4; Pixtral 12B bf16 (24 GB); Llama 3.2 11B Vision bf16; Gemma 3 12B / 27B multimodal; MiniCPM-V 2.6 / MiniCPM-o 2.6; Aya Vision 8B / 32B for 23-language VLM.

Image generation

L4 is inference-tuned — usable for steady-state image pipelines, not batch generation: FLUX.1 [dev] fp8 / Q4 — single image in 8-12 s; SD 3.5 Large fp8 / SDXL 1.0 / SD 3.5 Medium; HunyuanImage-2.1 NF4 (~14 GB); Kolors 2.0 fp8.

Video generation

Not recommended for new video projects on L4 — prefer a 4090/5090 build. For light T2V pipelines: Wan 2.2 TI2V-5B at bf16 — 5 s 720p in ~6-10 minutes; HunyuanVideo 1.5 (8.3B) Wan2GP optimization path.

Audio / Speech / TTS

The L4's real strength — 24/7 ASR + TTS + realtime voice stacks.

ASR: Whisper v3 large / turbo (~30x realtime on L4, published reference); NVIDIA Parakeet-TDT 1.1B; Canary 1B
TTS: CosyVoice 2.0 / Fun-CosyVoice 3.0; Kokoro 82M; Stable Audio Open
Realtime / S2S: Kyutai Moshi (7B, 200 ms latency full-duplex); Step-Audio 2 mini / R1
Translation: Meta SeamlessM4T v2 (~100 languages)

Multi-model / multi-tenant

Whisper v3 + Kokoro + Moshi + Qwen3-14B Q6 all resident on card 1 (~18-20 GB); card 2 reserved for a second tenant or a VLM
8-16 concurrent ASR sessions on a single L4 at Whisper-turbo real-time
RAG endpoint: Qwen3-14B / Llama 3.1 8B (~48-72 tok/s single-stream on L4, published reference) + BGE-M3 embeddings + reranker

Target workloads

Branch office or broadcast facility silent inference box
Always-on ASR + translation pipeline (call centers, lecture transcription, media captioning)
Edge RAG endpoint over corporate documents with datacenter warranty path
24/7 multimodal assistant (Qwen3-VL-8B + MiniCPM-o 2.6) for a small office
Development staging box for datacenter-class deployments — same L4 silicon as hyperscale edge

Published performance references

Published reference | 2x NVIDIA L4 comparable hardware

Benchmark	Result
Llama 3.1 8B Q4_K_M llama.cpp decode	~30-40 tok/s single-stream
Qwen3-14B Q6 vLLM decode	~20-28 tok/s
Whisper v3 large realtime factor	~15-20x per L4
Parakeet-TDT 1.1B English ASR	~40-60x real-time
Moshi 7B full-duplex voice	200 ms latency, fits on single L4

Published, not measured on Kentino hardware.

Not ideal for

70B dense at Q6+ (even 48 GB pool is tight — use 4x4090 or 2x5090)
Image / video generation batch work at scale (L4 tensor throughput is inference-tuned)
LoRA / fine-tuning workflows — use 4090/5090 builds instead

Warranty and lead time

2 years

parts warranty

1 year

labor warranty

10-28 days

lead time

L4 carries NVIDIA datacenter warranty path — meaningful advantage over consumer cards for 24/7 SLA deployment. Build includes assembly, BIOS configuration, driver install, burn-in testing, and functional verification.