Kentino s.r.o.
K-AI 48 Rome L4 484TOPS — 2x NVIDIA L4 Passive Edge AI Server
K-AI 48 Rome L4 484TOPS — 2x NVIDIA L4 Passive Edge AI Server
Не може да се зареди възможността за взимане
K-AI 48 Rome L4 484TOPS
Silent 2x L4 Passive Edge Server
48 GB ECC VRAM | EPYC Milan | 484 TOPS INT8
Silent 2x L4 passive inference box — datacenter-grade warranty path, 72 W per card, 48 GB ECC VRAM for always-on edge deployment.
A 2-GPU edge inference server built around passive NVIDIA L4 cards — the datacenter-class silent option in the Kentino lineup. 48 GB total ECC VRAM, 144 W total GPU draw, single-slot card footprint, and airflow driven entirely by the chassis. For branch offices, broadcast facilities, always-on transcription, and any deployment where acoustic profile and a datacenter warranty path matter more than raw tensor throughput.
Hardware
| Component | Detail |
|---|---|
| GPUs | 2x NVIDIA L4 24 GB GDDR6 passive (72 W, PCIe 4.0 x16, Ada Lovelace, ECC) |
| VRAM pool | 48 GB ECC |
| CPU | AMD EPYC 7643 Milan (48C/96T, 225 W, 128x PCIe 4.0 lanes) |
| Motherboard | ASRock Rack ROMED8-2T (SP3, 7x PCIe 4.0 x16, 8x DDR4 ECC, 2x 10 GbE, IPMI) |
| System RAM | 128 GB DDR4-2666 ECC RDIMM (2x 64 GB) |
| Boot / storage | 1 TB NVMe M.2 (PCIe 4.0 x4) |
| Power supply | Single 2 kW ATX PSU |
| Chassis | 4U rack-mount, passive Gen4 x16 risers |
| Cooling | SP3 tower cooler, 3x 120 mm front intake + 1x 120 mm rear exhaust (low-RPM PWM) |
| Network | Onboard dual 10 GbE (Intel X550) + IPMI |
Power envelope
- GPU draw: 2 x 72 W = 144 W
- System total at full load: ~469 W
- PSU total: 2 000 W — 76.55 % headroom
- Drives fans at idle-low RPM (~35 dBA idle, <45 dBA sustained inference)
Lane topology
PCIe Gen4 x16 at both GPUs. L4 is native Gen4 x16; ROMED8-2T fans out 2x16 directly from CPU. No switch, no NVLink. 55-65 C GPU temperature sustained — passive cards rely entirely on chassis airflow.
What you can run
With 48 GB of ECC VRAM across 2 passive L4 cards, this server handles always-on LLM inference, 24/7 ASR + TTS pipelines, VLM document processing, and edge deployments where silence and datacenter warranty matter.
LLMs — text / reasoning / coding
Chinese frontier
- Qwen3-32B dense Q6 with 32k ctx (~15-20 tok/s single-stream on L4, published reference)
- Qwen3-30B-A3B / Qwen3-Coder-30B-A3B Q4-Q6 (MoE, 256k ctx)
- QwQ-32B Q6; DeepSeek-R2 32B sparse MoE Q4-Q6 (~18-24 tok/s single-stream at Q4 on L4, published reference)
- Hunyuan-A13B Q6 or fp8 (~48 GB) — 80B/13B MoE, 256k ctx
- Seed-OSS-36B Q4-Q6 — 512k native ctx
- ERNIE-4.5-47B-A3B Q4-Q6 (~28-42 GB)
Western frontier
- Llama 3.3 70B Q4_K_M (~43 GB) tensor-parallel 2-way (~8-12 tok/s single-stream on 2x L4, published reference)
- Mistral Small 3 / Magistral / Devstral Small 2 (24B) bf16
- Gemma 3 27B multimodal bf16
- Phi-4 14B / Phi-4-reasoning bf16
- Nemotron-Super 49B Q4 (~28 GB)
- OLMo 2 32B / OLMo 3.1-32B-Think — fully open reasoning research
Vision-Language
Qwen3-VL-8B / 32B Q4-Q6; InternVL3.5-38B Q4; Pixtral 12B bf16 (24 GB); Llama 3.2 11B Vision bf16; Gemma 3 12B / 27B multimodal; MiniCPM-V 2.6 / MiniCPM-o 2.6; Aya Vision 8B / 32B for 23-language VLM.
Image generation
L4 is inference-tuned — usable for steady-state image pipelines, not batch generation: FLUX.1 [dev] fp8 / Q4 — single image in 8-12 s; SD 3.5 Large fp8 / SDXL 1.0 / SD 3.5 Medium; HunyuanImage-2.1 NF4 (~14 GB); Kolors 2.0 fp8.
Video generation
Not recommended for new video projects on L4 — prefer a 4090/5090 build. For light T2V pipelines: Wan 2.2 TI2V-5B at bf16 — 5 s 720p in ~6-10 minutes; HunyuanVideo 1.5 (8.3B) Wan2GP optimization path.
Audio / Speech / TTS
The L4's real strength — 24/7 ASR + TTS + realtime voice stacks.
- ASR: Whisper v3 large / turbo (~30x realtime on L4, published reference); NVIDIA Parakeet-TDT 1.1B; Canary 1B
- TTS: CosyVoice 2.0 / Fun-CosyVoice 3.0; Kokoro 82M; Stable Audio Open
- Realtime / S2S: Kyutai Moshi (7B, 200 ms latency full-duplex); Step-Audio 2 mini / R1
- Translation: Meta SeamlessM4T v2 (~100 languages)
Multi-model / multi-tenant
- Whisper v3 + Kokoro + Moshi + Qwen3-14B Q6 all resident on card 1 (~18-20 GB); card 2 reserved for a second tenant or a VLM
- 8-16 concurrent ASR sessions on a single L4 at Whisper-turbo real-time
- RAG endpoint: Qwen3-14B / Llama 3.1 8B (~48-72 tok/s single-stream on L4, published reference) + BGE-M3 embeddings + reranker
Target workloads
- Branch office or broadcast facility silent inference box
- Always-on ASR + translation pipeline (call centers, lecture transcription, media captioning)
- Edge RAG endpoint over corporate documents with datacenter warranty path
- 24/7 multimodal assistant (Qwen3-VL-8B + MiniCPM-o 2.6) for a small office
- Development staging box for datacenter-class deployments — same L4 silicon as hyperscale edge
Published performance references
Published reference | 2x NVIDIA L4 comparable hardware
| Benchmark | Result |
|---|---|
| Llama 3.1 8B Q4_K_M llama.cpp decode | ~30-40 tok/s single-stream |
| Qwen3-14B Q6 vLLM decode | ~20-28 tok/s |
| Whisper v3 large realtime factor | ~15-20x per L4 |
| Parakeet-TDT 1.1B English ASR | ~40-60x real-time |
| Moshi 7B full-duplex voice | 200 ms latency, fits on single L4 |
Published, not measured on Kentino hardware.
Not ideal for
- 70B dense at Q6+ (even 48 GB pool is tight — use 4x4090 or 2x5090)
- Image / video generation batch work at scale (L4 tensor throughput is inference-tuned)
- LoRA / fine-tuning workflows — use 4090/5090 builds instead
Warranty and lead time
L4 carries NVIDIA datacenter warranty path — meaningful advantage over consumer cards for 24/7 SLA deployment. Build includes assembly, BIOS configuration, driver install, burn-in testing, and functional verification.
Recommended add-ons
- Upgrade to K-AI 96 Rome L4 968TOPS (4x L4, 96 GB pool) for doubled throughput
- Upgrade boot drive to 2 TB NVMe
- Upgrade RAM to 256 GB (4x 64 GB) for multi-model concurrent serving
- Rack PDU + 2 kVA online UPS for branch deployment
Share
