Prejsť na informácie o produkte
1 z 7

Kentino s.r.o.

K-AI 64 Rome 5080 3600TOPS — 4x RTX 5080 Budget AI Server

K-AI 64 Rome 5080 3600TOPS — 4x RTX 5080 Budget AI Server

Normálna cena €11.940,00 EUR
Normálna cena Cena po zľave €11.940,00 EUR
Zľava Vypredané
Vrátane daní. Doprava sa vypočíta pri platbe.

K-AI 64 Rome 5080 3600TOPS

Budget 4-GPU Blackwell Server
4x RTX 5080 | EPYC Milan | 3 600 TOPS INT8

3 600
TOPS INT8
64 GB
VRAM pool
4 GPU
Blackwell
rack
ready

Kentino's budget 4-GPU Blackwell server — 64 GB VRAM pool, 3 600 aggregate TOPS INT8, lowest CZK-per-TOPS in the lineup.

A 4-GPU Blackwell inference server built around the RTX 5080 — 360 W per card, PCIe 5 silicon, 16 GB GDDR7 each. Four cards deliver a 64 GB pooled VRAM envelope and 3 600 INT8 TOPS aggregate at the best CZK-per-TOPS point Kentino offers. The entry into multi-GPU Blackwell inference: ideal for embedding clusters, 7-13B model serving at scale, image / video batch generation, and 70B Q4 tensor-parallel.

Hardware

Component Detail
GPUs 4x NVIDIA GeForce RTX 5080 16 GB GDDR7 (360 W, PCIe 5.0 x16)
VRAM pool 64 GB
CPU AMD EPYC 7643 Milan (48C/96T, 225 W, 128x PCIe 4.0 lanes)
Motherboard ASRock Rack ROMED8-2T (SP3, 7x PCIe 4.0 x16, 8x DDR4 ECC, 2x 10 GbE, IPMI)
System RAM 256 GB DDR4-2666 ECC RDIMM (4x 64 GB)
Boot / storage 2 TB NVMe M.2 (PCIe 4.0 x4)
Power supply Single 2 kW ATX PSU
Chassis 4U rack-mount, 4x GPU, passive Gen4 x16 risers, front-to-back directed airflow
Cooling SP3 tower cooler, 3x 120 mm front intake + 1x 120 mm rear exhaust (industrial fans)
Network Onboard dual 10 GbE (Intel X550) + IPMI

Power envelope

  • GPU draw: 4 x 360 W = 1 440 W
  • System total at full load: ~1 765 W
  • PSU total: 2 000 W (single 2 kW ATX) — 11.75 % headroom
  • Above the 10 % floor but tighter than other 4-GPU builds; dual-PSU upgrade recommended for high-duty workloads

Lane topology

ROMED8-2T fans out 4x16 Gen4 from CPU root complex. 5080 is PCIe Gen5 silicon running Gen4 x16 without bandwidth bottleneck for inference. No PCIe switch. No NVLink — tensor parallel over PCIe.

What you can run

With 64 GB of pooled VRAM across 4 Blackwell cards, this server handles 70B Q4 tensor-parallel, embedding clusters at scale, image and video batch pipelines, and 7-13B multi-tenant serving for 64-128 concurrent users.

LLMs — text / reasoning / coding

Chinese frontier

  • Qwen3-32B Q8 (dense at near-fp16 quality); Qwen3.5-27B bf16
  • Qwen3-30B-A3B / Qwen3-Coder-30B-A3B bf16 (~60 GB fits tight)
  • Qwen3.5-122B-A10B Q4 (~70-75 GB — tight, spill to DDR4 RAM)
  • Hunyuan-A13B fp8 (~80 GB native — tight, prefer Q6)
  • Seed-OSS-36B bf16 (~72 GB tight)
  • DeepSeek-R2 32B sparse MoE bf16 (~64 GB) (~45-60 tok/s single-stream at Q4 on Blackwell, published reference)
  • GLM-4.5-Air 106B/12B Q3_K (~55 GB) — tight KV headroom
  • ERNIE-4.5-47B-A3B Q4 (~28 GB with headroom for second model)

Western frontier

  • Llama 3.3 70B Q4_K_M (~43 GB) — the sweet spot for this pool (~30-36 tok/s single-stream on 4x 5080, published reference)
  • Hermes 3 70B / Tulu 3 70B Q4 — open Llama derivatives with full post-training transparency
  • Mistral Small 3 / Magistral / Devstral Small 2 24B bf16
  • Gemma 3 27B bf16 multimodal
  • Phi-4 14B / Nemotron-Super 49B Q6-Q8
  • gpt-oss-20b MXFP4 (16 GB — 4 instances on 4 cards for parallel tenants); gpt-oss-120b MXFP4 (80 GB — tight; spill manageable)

Vision-Language

Qwen3-VL-32B / Qwen3-VL-30B-A3B / Qwen3-Omni-30B-A3B; InternVL3.5-38B Q6-Q8; Llama 3.2 90B Vision Q4 (~52 GB tight); Pixtral 12B / Pixtral Large 124B Q2-Q3; Gemma 3 27B multimodal bf16; PaliGemma 2 28B bf16; Molmo 72B Q4 (~45 GB); Aya Vision 32B bf16.

Image generation

FLUX.1 [dev] / [schnell] fp16 — batch-4 parallel (~10-15 seconds per 1024x1024 image at fp8 on Blackwell, published reference); FLUX.1 Kontext [dev] — in-context editing across 4 tenants; SD 3.5 Large (18 GB fp16) — 4 parallel generators; SDXL 1.0 + ControlNet + AnimateDiff stacks x 4; HunyuanImage-2.1 bf16 per-card; AuraFlow v0.3 / OmniGen v1 / Kolors 2.0 / PixArt-Sigma.

Video generation

Wan 2.2 TI2V-5B bf16 on a single card — 4 parallel tenants; Wan 2.1 14B T2V/I2V Q4-Q6 per card; HunyuanVideo 13B Q4 (~30 GB) tensor-parallel 2-way; HunyuanVideo 1.5 (8.3B) bf16 per card; Open-Sora 2.0 (11B) Q8 per card — 4 parallel generations; CogVideoX-5B int8; Mochi-1 Q4 per card.

Audio / Speech / TTS

Full Western and Chinese audio stack fits per card: Whisper v3 + Parakeet + Canary + Moshi + Step-Audio 2 / R1 + CosyVoice 3.0 + Kokoro + Stable Audio Open + MusicGen + AudioGen + SeamlessM4T v2. With 4 cards, each card can host a dedicated speech tenant. Whisper v3 turbo runs at ~50x realtime per card (published reference).

Multi-model / multi-tenant

The target use case. 16 GB per card rewards partitioned workloads:

  • Embedding cluster: BGE-M3 / Nomic / Jina-embed / E5 / Cohere Embed v3 — 4 tenants at high RPS
  • 7-13B serving at scale: 16-32 concurrent users per card via vLLM / SGLang; 64-128 concurrent total
  • Mixed pipeline: Card 1 = Qwen3-14B + reranker; Card 2 = Whisper + Moshi; Card 3 = FLUX.1; Card 4 = Wan 2.2 TI2V
  • 4-way tensor-parallel for 70B Q4 — Llama 3.3 70B AWQ INT4 across 4 cards, ~90-130 tok/s batch aggregate (extrapolated from gf-logic 4x4090 bench)

Target workloads

  • Budget multi-GPU AI serving platform for a startup or lab on a capex floor
  • Embedding + RAG infrastructure at 4-way horizontal scale
  • Image / video generation batch farm (Stable Diffusion / FLUX / Wan 2.2)
  • 7-13B small-model serving at scale — 4 independent tenants or 64-128 concurrent pooled
  • Development staging box for 70B Q4 tensor-parallel workflows

Published performance references

Kentino measured (4x4090 reference) + published 5080 estimates

Benchmark Result
4x4090 reference: sustained fp16 647 TFLOPS
4x4090 reference: vLLM Llama 3.3 70B AWQ (batch-32) 179.3 tok/s aggregate
4x4090 reference: llama.cpp 70B Q4_K_M (single) 20.3 tok/s decode
5080 estimated: Llama 3.3 70B Q4 TP-4 single ~15-20 tok/s
5080 estimated: FLUX.1 fp8 per card ~2.2-2.8 s per 1024x1024 at 20 steps

5080 tensor throughput ~1.35x 4090 per INT8 TOPS; single-stream decode is memory-bandwidth-bound (GDDR7 ~960 GB/s vs 4090 ~1 008 GB/s — roughly parity).

Not ideal for

  • 70B dense at Q6+ (16 GB-per-card limits per-card footprint; 64 GB pool is tight for Q6)
  • Long-context MoE flagships (Qwen3-235B, GLM-4.5) — insufficient VRAM even Q2
  • Single-stream latency-sensitive work on very large models (TP overhead eats into 16 GB cards)

Warranty and lead time

2 years
parts warranty
1 year
labor warranty
10-28 days
lead time

Build includes assembly, BIOS configuration, driver install, burn-in testing, and functional verification. Lead time depends on component availability, confirmed at order.

Recommended add-ons

  • Upgrade PSU to dual 2 kW ATX synced — raises headroom to 55 %
  • NVIDIA ConnectX-5 100 GbE MCX555A-ECAT
  • Upgrade boot drive to 4 TB NVMe
  • Upgrade RAM to 384 GB (6x 64 GB) — better multi-model concurrent headroom
  • Rack PDU (C13/C19 metered) and 3 kVA online UPS
Zobraziť všetky podrobnosti