Setting Up an Inference Server: vLLM, llama.cpp, SGLang

The hardware arrives, the driver works, nvidia-smi shows every card. Now the question is what actually serves tokens. The answer is one of four or five open serving stacks, and the wrong choice will cost you 2× throughput, 3× latency, or three days of debugging. This article picks between them honestly and walks the setup for the one most Kentino customers should pick first: vLLM in Docker, with the NGC-validated image, on an OpenAI-compatible endpoint.

Audience: somebody who has read L02, can run docker run --gpus all nvidia-smi, and now needs the next layer up.

The decision matrix

Five stacks worth considering in May 2026. Everything else is either a wrapper around one of these or a research project.

Stack Best at Worst at When to pick
vLLM Single-model production serving, OpenAI-compatible API, throughput on Blackwell PCIe Multi-model heterogeneous workloads, MoE at extreme scale The default. 90% of Kentino installs.
SGLang Structured output (JSON), agent workflows, prefix-heavy multi-turn chat, large MoE Smallest deployments, niche model archs RAG, agents, JSON-out APIs, DeepSeek-V3 class.
llama.cpp Single-user, GGUF, mixed CPU/GPU, Jetson and Mac, small dev boxes Concurrent users at scale, FP8/Blackwell-native kernels Dev laptop, Jetson Orin, single-user appliance, one 5090 with no Linux fight.
TensorRT-LLM + Triton Multi-model serving, ensemble pipelines, lowest steady-state latency on H100/B200 Setup time, iteration speed, anything fast-moving Multi-model production over months. Heavy ops.
NVIDIA NIM Out-of-box, NVIDIA-QA'd, enterprise support Open weights not in catalog, ops want control Buying NVIDIA AI Enterprise, fastest time-to-running.

The opinionated short version: if you are not sure, run vLLM in the NGC container. If your workload is heavy on structured output or shared system prompts, run SGLang. If you are on a Jetson or a single 4090 dev box, run llama.cpp. Everything else is a corner case.

vLLM is the default — and the reason

vLLM (v0.20+ as of May 2026) is the most-deployed open serving stack for transformer LLMs and VLMs. Three things give it the lead:

  1. Continuous batching — incoming requests join an in-flight batch on the next forward pass. GPU utilization on a mixed-traffic endpoint goes from 40% to 85%+ versus naive batched HuggingFace inference.
  2. PagedAttention plus prefix caching — KV cache is managed in fixed-size blocks like OS virtual memory pages. Two requests sharing a system prompt share the KV blocks for that prefix. For agent workflows with 2 KB shared system prompts, prefix-cache hit rate runs 80–95%.
  3. Blackwell-first kernels — FlashAttention 3, FP8 attention, MXFP4 weight-only quant, and the CUTLASS-based matmul path target sm_120 (5090, RTX Pro 6000 Blackwell) and sm_100 (B200) natively.

CUDA 13 is the default for v0.20+ PyPI wheels and the vllm/vllm-openai:latest image. CUDA 12.8 wheels are still shipped for sm_120 fallback.

Installing vLLM: pip vs Docker

Two install paths. Pick Docker unless you have a specific reason not to.

Path A — pip in a venv (development only)

python3.12 -m venv ~/venvs/vllm && source ~/venvs/vllm/bin/activate
pip install --upgrade pip && pip install vllm   # CUDA 13.0 wheels by default

vllm serve meta-llama/Llama-3.3-70B-Instruct-FP8 --tensor-parallel-size 4 --port 8000

Pip works and is faster for iterating on engine flags. It also pins your Python interpreter, CUDA runtime, and driver-runtime compatibility to one host configuration. Use it for development; not for production.

Path B — Docker with the upstream image (production default)

docker run --gpus all --runtime nvidia \
  --ipc=host \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.3-70B-Instruct-FP8 \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.92 \
  --max-model-len 8192 \
  --enable-prefix-caching

Notes that bite people who skip them:

  • --ipc=host is required for multi-GPU. vLLM workers communicate over shared memory; the default Docker IPC namespace is 64 MB and cannot hold the NCCL buffers.
  • Mount the HuggingFace cache. Otherwise every docker run re-downloads 75 GB of weights.
  • Pin a tag for production: vllm/vllm-openai:v0.20.2 not :latest.

NGC's nvcr.io/nvidia/vllm:25.09-py3 ships CUDA 13.0, a tested vLLM build, NCCL, and NVIDIA's QA stamp. Larger (~12 GB), a month or two behind upstream. Use it if you also run TensorRT-LLM or Triton on the same host and want one CUDA/NCCL stack across all of them. Otherwise the Docker Hub image is smaller, fresher, equivalent.

The flags that actually matter

vLLM exposes roughly two hundred CLI flags. Most are situational. The ones you will touch every time:

Flag What it does Sensible default
--tensor-parallel-size N Shard each layer across N GPUs in the node. 4 on a 4-GPU box, 1 if model fits.
--pipeline-parallel-size M Split layers across M stages. 1 unless model spans nodes.
--gpu-memory-utilization 0.92 Fraction of VRAM vLLM pre-allocates for weights + KV + activations. 0.90–0.92. Higher if no other tenant.
--max-model-len 8192 Maximum context. Caps KV cache budget per request. Set to what you actually serve. Lying upward burns memory.
--max-num-seqs 64 Max concurrent in-flight requests. 32–128. Tune with vllm bench serve.
--enable-prefix-caching Automatic prefix-cache reuse across requests. On. Free wins for shared system prompts.
--quantization fp8 / awq / gptq Tells vLLM the weight format. Often inferred from model name, but explicit is safer. Set it if the model card says so.
--swap-space 4 GiB of CPU RAM per GPU usable as paged-out KV. Default 0 = no offload. 4–8 if you preempt under load.
--port 8000 OpenAI-compatible endpoint. 8000 unless you collide.
--api-key sk-... Bearer-token auth. Set it. Or terminate auth at the proxy. Set one. Don't expose raw vLLM.

--gpu-memory-utilization is the most-tuned flag. vLLM uses it to decide how much VRAM to pre-allocate after loading weights; the leftover becomes the KV cache pool. Too low → premature KV preemption under load. Too high → OOM on the first long-context request. 0.92 is a reasonable starting point for a dedicated inference box. Drop to 0.85 if you share the GPU.

Three concrete launch commands

These are the configurations Kentino customers actually run. All assume CUDA 13.0, driver 570+, the Docker image, and --ipc=host.

Qwen 2.5 72B Instruct INT4 (AWQ) on 4× RTX Pro 6000 Blackwell

docker run --gpus all --runtime nvidia --ipc=host \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  vllm/vllm-openai:v0.20.2 \
  --model Qwen/Qwen2.5-72B-Instruct-AWQ \
  --quantization awq \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --max-num-seqs 64 \
  --port 8000

Roughly 36 GB of weights at INT4, 4× 96 GB cards. At TP=4 each card sees 1/4 of the KV per request, so 32 K context at 64 concurrent users sits well inside budget. Expect 40–60 tok/s per request at low concurrency, aggregate ~600–900 tok/s at 32 concurrent.

Llama 3.3 70B Instruct FP8 on 8× RTX 5090

docker run --gpus all --runtime nvidia --ipc=host \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  vllm/vllm-openai:v0.20.2 \
  --model meta-llama/Llama-3.3-70B-Instruct-FP8 \
  --quantization fp8 \
  --tensor-parallel-size 4 \
  --pipeline-parallel-size 2 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --max-num-seqs 64 \
  --swap-space 4 \
  --port 8000

Note TP=4 × PP=2, not TP=8. 5090s have 32 GB each; FP8 weights for 70B are ~75 GB, comfortably under 128 GB on four cards. The pipeline split avoids the all-reduce blow-up TP=8 over PCIe would cost (see K03, N03). TP=8 on PCIe scales worse than TP=4 × PP=2 with continuous batching at concurrency above 8.

Qwen 2.5 VL 32B on dual 5090

docker run --gpus all --runtime nvidia --ipc=host \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:v0.20.2 \
  --model Qwen/Qwen2.5-VL-32B-Instruct-AWQ \
  --quantization awq \
  --tensor-parallel-size 2 \
  --max-model-len 16384 \
  --limit-mm-per-prompt '{"image": 4, "video": 0}' \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --port 8000

Two 5090s, ~20 GB INT4 weights, room for vision encoder + KV. The 72B VL variant does not fit on dual 5090 even at INT4; use 4× 5090 or a single Pro 6000 for that. --limit-mm-per-prompt caps images per request — a single request with 20 images can OOM the vision encoder on a 5090. The endpoint is OpenAI-compatible (POST /v1/chat/completions with image_url parts).

SGLang — when its router beats vLLM

SGLang's pitch is RadixAttention: prefix-cache reuse implemented as a radix tree across requests, not just block-equality matching. For workloads where most requests share long system prompts, the hit rate beats vLLM's prefix cache. Public benchmarks show SGLang at ~16k tok/s vs vLLM ~12k tok/s on H100 for shared-prefix workloads, with much bigger gaps on structured-output traffic.

Where SGLang wins:

  • Structured JSON via xGrammar. Faster and more compliant than vLLM's constrained decoding. For a JSON-out API at millions of requests, this matters.
  • DeepSeek-V3 and other large MoE. Wide expert-parallel and prefill/decode disaggregation shipped earlier; still ahead at multi-node scale.
  • Agent workflows with shared system prompts. RAG endpoints, copilots, 2–4 KB system prompts shared across most requests.

Where vLLM wins: batch processing with unique prompts (RadixAttention's edge collapses), model breadth (more architectures out of the box), and time to first running endpoint.

docker run --gpus all --runtime nvidia --ipc=host \
  -p 30000:30000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  lmsysorg/sglang:latest \
  python3 -m sglang.launch_server \
    --model-path Qwen/Qwen2.5-72B-Instruct-AWQ \
    --quantization awq \
    --tp 4 \
    --port 30000 \
    --host 0.0.0.0

SGLang exposes an OpenAI-compatible endpoint on its own port (30000 by default). Drop-in for the same client code, plus SGLang-native structured generation endpoints. The honest recommendation: try vLLM first. If your shared-prefix hit rate is above ~60% and you care about tail latency, retest with SGLang and pick whichever wins.

llama.cpp — when small and singular wins

llama.cpp is C++ inference with no Python runtime, no CUDA dependency, and a GGUF weight format that bundles quantization metadata into the file. It runs on CPU, single GPU, Mac M-series, Jetson, or partially on each. It does not do tensor parallel the way vLLM does. One model, one inference loop, fast.

When llama.cpp is the right answer:

  • Jetson Orin. vLLM does not target Jetson well; llama.cpp does. For an on-board model on a Unitree G1, this is the standard answer.
  • Single 5090 / 4090 dev box. One developer iterating, no concurrency, fastest install path.
  • Mixed CPU+GPU split. A 70B on a 5090 (32 GB) doesn't fit. With -ngl 40, 40 of 80 layers go to GPU, the rest run on CPU at single-user-acceptable speed.
  • Embedded appliance. No Docker, no driver mismatch headaches, no Python.

GGUF quantization choices, in order of size vs quality:

Quant Size vs FP16 Quality When to pick
Q2_K ~2.5 bits Visibly degraded Demo only.
Q3_K_M ~3.5 bits Noticeable degradation Smallest viable for usable output.
Q4_K_M ~4.5 bits Good to very good The default starting point.
Q5_K_M ~5.5 bits Very close to FP16 If you have the VRAM, take it.
Q6_K ~6.5 bits Indistinguishable from FP16 For quality-sensitive work.
Q8_0 ~8.5 bits Effectively lossless Maximum quality.
IQ4_XS / IQ3_M i-quants Better quality-per-bit at small sizes When fitting on tight VRAM.

Q4_K_M is the right default. Q5_K_M if you have the headroom. Q8_0 only to verify "no quant loss." The i-quants (IQ4_XS et al.) are a 2025/2026 evolution worth trying when squeezing a 70B onto a single 32 GB card.

git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release -j

./build/bin/llama-server \
  --model models/qwen2.5-72b-instruct-q4_k_m.gguf \
  --n-gpu-layers 99 --ctx-size 32768 \
  --host 0.0.0.0 --port 8080 --api-key sk-localdev

-ngl 99 offloads as many layers as fit. The endpoint is OpenAI-compatible at :8080/v1/chat/completions. For multi-user load this is the wrong tool — use vLLM. For one user on one 5090 or a Jetson, it is the right one.

TensorRT-LLM and Triton — the heavy option

TensorRT-LLM is NVIDIA's optimizing compiler for LLM inference. It builds a model into a TensorRT engine ahead of time, fuses kernels, picks the best layout per GPU. Typically 10–25% throughput and 15–30% latency improvement over vLLM on the same hardware, more on H100/B200. The cost is a build step (minutes to hours per model + GPU config), engines that are not portable across CUDA / driver / GPU generations, and an operational story harder than docker run.

Triton Inference Server is the multi-model serving layer that hosts TensorRT-LLM engines (plus PyTorch, ONNX, TensorFlow, Python, vLLM-as-backend) in one server. The model repository pattern lets you load LLM A, LLM B, a vision model, an ASR model, and a Python pipeline behind one URL with versioning, A/B routing, and ensemble graphs.

Worth the cost when: serving multiple heterogeneous models on one server; months of steady-state production with stable model choice; squeezing the last bit of p99 latency on Hopper/Blackwell datacenter GPUs.

Not worth it when: you are still picking your model (every swap is a new engine build); you are on consumer/workstation GPUs (the TRT-LLM speedup shrinks vs H100, and vLLM ships next week's model six weeks before TRT-LLM does); you are a small team without ops bandwidth.

NVIDIA NIM — the prebuilt path

NIM (NVIDIA Inference Microservices) bundles "Triton + TensorRT-LLM + an optimal config for this specific model + an enterprise license" into one container. docker pull nvcr.io/nim/meta/llama-3.3-70b-instruct:latest, set an NGC key, run, and you have a tested OpenAI-compatible endpoint without picking quant or tuning flags.

Fits when: you bought NVIDIA AI Enterprise (or a customer requires it); you want fastest time to a known-good endpoint (hours, not days); the model you want is in the catalog (Llama, Mistral, Mixtral, Gemma, Nemotron and a growing partner set). Post-GTC 2026 a free tier covers up to 16 GPUs for Developer Program members — enough to evaluate on most Kentino single-server installs; verify current licensing terms before betting on it.

Does not fit when: the model is not in the catalog (open weights arrive with weeks of lag); you want to tune (NIM hides most knobs by design).

Reverse proxy, auth, rate limiting

Do not expose vLLM directly on the public internet. Put nginx (or Caddy, or Traefik) in front of it. A minimal nginx config:

upstream vllm_backend {
    server 127.0.0.1:8000;
    keepalive 32;
}

limit_req_zone $binary_remote_addr zone=vllm:10m rate=20r/s;

server {
    listen 443 ssl http2;
    server_name infer.example.com;

    ssl_certificate     /etc/letsencrypt/live/infer.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/infer.example.com/privkey.pem;

    location /v1/ {
        if ($http_authorization != "Bearer sk-yourlongrandomtoken") {
            return 401;
        }
        limit_req zone=vllm burst=40 nodelay;

        proxy_pass http://vllm_backend;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_buffering off;             # streaming SSE
        proxy_read_timeout 600s;
        chunked_transfer_encoding on;
    }
}

Three points people get wrong:

  • proxy_buffering off is required for streaming responses. Otherwise tokens accumulate in the proxy and the client sees a one-shot delivery.
  • proxy_read_timeout defaults to 60s. Long multi-modal prefills at 4 tok/s blow past that. Set it to 5–10 minutes.
  • The if block is exact-match. For real auth, use Lua, oauth2-proxy, or a gateway like Kong.

Monitoring

Two metric sources, two scrape targets.

  • vLLM's /metrics endpoint. Prometheus-compatible, same port as the OpenAI API (:8000/metrics). Publishes vllm:num_requests_running, vllm:num_requests_waiting, vllm:gpu_cache_usage_perc, vllm:time_to_first_token_seconds, vllm:time_per_output_token_seconds. KV cache utilization is the one that catches preemption before it kills latency.
  • DCGM exporter (nvcr.io/nvidia/k8s/dcgm-exporter, port 9400). Exports GPU SM util, memory util, power, temperature, PCIe BW, ECC counters.

Both into Prometheus, both into Grafana. First dashboard: requests-running, KV-cache-usage, GPU-util, GPU-temp, P95 TTFT, P95 TPOT. If KV-cache-usage saturates while requests-waiting climbs, you are preempting — drop --max-num-seqs, raise --gpu-memory-utilization, or add a replica.

The model warmup gotcha

The first request to a freshly-started vLLM is 5–30× slower than steady-state. CUDA graphs are captured for each batch shape, the prefix cache is empty, the scheduler is calibrating. Fire a small /v1/chat/completions request on startup before joining the load balancer; without it, the first real user gets a 15-second response on a model that normally answers in 1. For blue/green deploys, warm the new replica before draining the old. Skipping warmup is the single most common cause of "the deploy looked fine but users complained for ten minutes."

Multi-model on one server

vLLM is fundamentally a one-model server. Three options if you need multiple:

  1. One container per model, different ports. Each holds its own VRAM. A 70B and a 7B share a 4-GPU box via CUDA_VISIBLE_DEVICES slicing plus matching --gpu-memory-utilization. Memory budget is your job to plan.
  2. Load on demand. A front-end loads/unloads as requests arrive. Load takes 30–120 s for a 70B; first-hit latency is unacceptable for interactive use. Fine for batch.
  3. Triton's model repository. Triton owns the lifecycle of N models in one server, routes by model name. What you graduate to when option 1 stops scaling.

For most Kentino installs, option 1 is right until you have four or more models, at which point Triton's operational tax is worth paying.

The honest take

Ninety-five percent of Kentino customers should start with vllm/vllm-openai in Docker, behind nginx, on one model, with Prometheus + DCGM exporter scraping, and not look at anything else for the first three months. SGLang earns its place for structured output or shared-prefix agent traffic at scale. llama.cpp earns its place on a Jetson, a Mac, or a single-user dev box. Triton and TensorRT-LLM earn their place when you have months of stable production with multiple models. NIM earns its place when the model is in the catalog and the license is in hand.

The cost of starting too complex is real. We have seen lab installs spend three weeks getting Triton + TensorRT-LLM to serve a single Llama 70B that vLLM would have hosted in twenty minutes. Pick the simple thing first. Add complexity when you have evidence you need it.

What to do next

For a new K-AI server owner, here is the five-step path:

  1. Verify the floor. docker run --rm --gpus all nvidia/cuda:13.0.0-base-ubuntu24.04 nvidia-smi should list every GPU. If it doesn't, fix that first (see L02).
  2. Pull the vLLM image and launch one model. Pick one of the three recipes above. Bind to localhost. Hit /v1/models and /v1/chat/completions from curl on the host.
  3. Put nginx in front. TLS via Let's Encrypt, bearer-token auth, rate limit, proxy_buffering off. Verify streaming works from a remote client.
  4. Wire up Prometheus + DCGM exporter + Grafana. Build the four-panel dashboard: KV-cache-usage, requests-running, GPU-util, P95 time-per-output-token. Set an alert on KV-cache > 95% sustained.
  5. Run a load test. vllm bench serve against your endpoint with realistic prompt shapes and concurrency. Tune --max-num-seqs, --gpu-memory-utilization, and --max-model-len until P95 latency and aggregate throughput hit your SLA.

Follow-ups in this track: network topology (I03), power and cooling for a robotics lab (I04), the reference build (I05), and fleet deployment (I06). The clustering math sits in K03, the interconnect reality in N03.


This is part of the Kentino Wiki, a reference series on AI compute, robotics, and the systems that connect them. Comments and corrections welcome at info@kentino.com.