Edge AI Architecture: How a Robot Talks to an On-Prem Inference Server
If you are buying a humanoid or quadruped robot in 2026, the unit you receive is only half the system. The other half is the compute that runs the large models the robot itself cannot host. This article explains what actually lives where, why the on-prem path exists at all when cloud APIs are cheap, and how the two halves are physically and logically wired together.
The audience is buyers and integrators who want a clear picture before signing a purchase order — not yet engineers writing the glue code.
The two-box problem
A modern humanoid like the Unitree G1 or Booster T1 carries an embedded compute module on board: typically a Jetson Orin (NX or AGX class) or a Snapdragon-class SoC, often paired with a small dedicated MCU for the motor control loop. Quadrupeds like the Unitree Go2 use a similar layout at a smaller scale.
That on-board compute is enough for:
- Real-time motor control — joint torque, balance, gait. This runs at 500 Hz to 1 kHz and cannot tolerate any network hop. It stays local, period.
- Immediate safety reflexes — fall-protection, emergency stop, contact response. Same constraint, same answer.
- Low-latency perception — depth from stereo or RGB-D, IMU fusion, basic object detection (typically a small YOLO or MobileNet variant).
- Voice wake-word and basic command routing.
What the on-board compute cannot reasonably run:
- Large vision-language models (VLMs) like Qwen2.5-VL 72B, NVIDIA Cosmos, OpenVLA. These need tens of GB of VRAM and are too power-hungry for a battery-powered platform.
- Large language models (LLMs) for dialogue or planning above the 7B–8B class quantized.
- Diffusion-based or autoregressive motion planners that operate on hundreds of milliseconds of context.
- Scene memory and long-horizon task graphs.
This split is not a Kentino opinion. It is the actual partition every credible humanoid platform ships with today. The on-board module is sized for safety and latency; the heavy thinking goes off-board.
The "off-board" target can be the cloud (OpenAI, Anthropic, NVIDIA Cosmos hosted) or an on-prem inference server. The rest of this article is about when and why you want the on-prem option.
What lives where — the practical split
- 500 Hz–1 kHz control loop
- Local safety reflexes
- Stereo / RGB-D depth
- Lightweight YOLO / MobileNet
- IMU + sensor fusion
- Wake-word, command routing
- Local microphone, speaker
- Battery management
- VLM — Qwen2.5-VL, Cosmos, OpenVLA
- LLM dialogue & planning (70B+)
- Motion / trajectory planner
- Scene memory / RAG
- Training — LoRA, RLHF
- Simulation — Isaac Sim
The two-box split: local control stays on the robot; heavy inference and training go to the on-prem server over LAN.
Three things to notice:
- The robot does not need a GPU server to walk. If the LAN goes down, the unit still stands up, holds balance, and avoids hitting the wall. The off-board pipeline adds capability — language, planning, long-horizon tasks — it does not replace local control.
- Some workloads are negotiable. A small VLM (e.g. Qwen2.5-VL 7B quantized) can run on a Jetson Orin AGX. Whether you push it on-board or off-board is a power, thermal, and latency trade. There is no one right answer.
- The server is doing more than serving inference. A serious deployment also runs training (for fine-tuning to a specific environment), simulation (Isaac Sim for policy iteration), and scene memory / retrieval. The "inference server" is shorthand for "GPU compute backend."
The latency budget
The single number that determines whether your architecture is going to work is the round-trip time from the robot to the server and back.
| Hop | Typical latency |
|---|---|
| On-board YOLO inference (Jetson Orin AGX) | 8–15 ms |
| LAN (wired 2.5/10 GbE) round-trip | 0.2–0.5 ms |
| Wi-Fi 6 round-trip (good conditions) | 3–10 ms (jittery) |
| Server-side VLM inference (Qwen2.5-VL 72B) | 200–800 ms TTFT |
| Server-side LLM token generation (70B Q4) | 30–80 ms/token |
| WAN to cloud (within EU) | 15–40 ms RTT |
| WAN to cloud (transatlantic) | 80–120 ms RTT |
The numbers that dominate the budget are server-side model latency and (if you are wireless) Wi-Fi jitter. The transport itself is rarely the bottleneck on a wired LAN. That is why most serious robot installs use a wired tether or a dedicated Wi-Fi 6/6E AP within line-of-sight of the work area.
If your robot must answer a verbal command in under one second, the math is roughly:
audio capture ~ 50 ms
wake-word + STT 100–250 ms (on-board or off-board, your choice)
LLM TTFT (planning) 200–500 ms (server-side)
LLM stream 20 tokens 600–1200 ms
TTS 100–300 ms
audio playout ~ 50 ms
--------------
~ 1.1 – 2.3 s
There is no way to hit "under one second end-to-end" with a 70B LLM in the loop today. You either accept the latency, use a smaller model on-prem (8B–13B class), or split the response (acknowledge immediately, plan in background).
This is the kind of trade-off that has nothing to do with the network and everything to do with model choice. It is also the kind of trade-off where having your own server matters: you can swap models, run multiple at once, batch differently. With a cloud API, you take what the provider serves.
Why on-prem at all
Cloud inference is cheap, scales effortlessly, and requires zero capex. So why would anyone buy a four-or-eight-GPU server when an OpenAI API key is fifty dollars?
There are four real reasons, in rough order of how often they actually drive the decision:
1. Data does not leave the building. Industrial, defense, healthcare, and EU/GDPR-sensitive deployments often cannot ship raw sensor data to a third-party cloud. The robot sees a factory floor, a patient room, a lab bench. That video is either privileged, regulated, or competitive. On-prem inference solves this cleanly. Cloud does not.
2. Latency floor. Every cloud call has a 15–40 ms WAN RTT on top of model latency. For most robotics tasks this is fine. For closed-loop reactive control — pick-and-place at speed, balance recovery after a push, fine manipulation — the WAN floor is too much. On-prem brings it under 1 ms.
3. Cost at sustained load. A single 70B VLM call costs the cloud provider almost nothing; they charge you a few cents. But a robot that is "always thinking" makes thousands of calls per hour. A research lab running training and a fleet of robots both inferring against the same models can saturate $5,000–$15,000/month in cloud spend. An on-prem 4× or 8× GPU server pays back in under twelve months at that load.
4. Model choice and customization. You want to fine-tune a VLM on your specific environment. You want to run a model the cloud providers do not host. You want to mix open-weight models from different vendors in a custom pipeline. On-prem gives you that. Cloud APIs do not.
If none of those four matter to your deployment, you should use the cloud. The honest answer is that perhaps 30–40% of robotics buyers actually need on-prem; the rest are choosing it for prestige or because their procurement office is uncomfortable with cloud. Both are valid reasons too, but we are not going to pretend otherwise.
Network topology
1 AP / ~50 m²
wired 10 GbE
Network topology: robot on Wi-Fi 6E → dedicated AP → 10 GbE switch → inference server. Egress and WAN are optional.
A few practical notes that catch people by surprise:
- Wi-Fi 6E is the floor, not the ceiling. 6 GHz band is the only one with consistent low-latency behavior in environments with other devices. Plan one AP per ~50 m² of robot working area, line-of-sight if possible.
- Wired is still better. If the robot has a tether option (some research-tier humanoids do, quadrupeds usually do not), use it during development. The development experience with sub-millisecond LAN is dramatically nicer than fighting Wi-Fi.
- Egress is optional but useful. Even on a fully on-prem deployment, you usually want WAN egress for: updating models, fetching map data, pulling NVIDIA NGC containers, telemetry to your own monitoring. Firewall it tightly; do not expose the robot or the server to the internet.
- Time sync matters. Run a local NTP server. Sensor fusion and motion planning break in subtle ways when clocks drift across the robot, server, and any auxiliary edge devices.
Power and cooling
A 4-GPU K-AI server (4× RTX 5090 or 4× RTX Pro 6000 Blackwell) draws 1.8–2.4 kW sustained under load. An 8-GPU server draws 3.5–4.5 kW. These are not desktop numbers; they need 16 A circuits and proper rack airflow.
For a robotics lab the rough budget is:
| Item | Sustained power |
|---|---|
| Humanoid robot (charging dock) | 0.5–1.5 kW |
| Quadruped robot (charging dock) | 0.2–0.5 kW |
| 4-GPU K-AI server | 1.8–2.4 kW |
| 8-GPU K-AI server | 3.5–4.5 kW |
| Network gear (switch, APs) | 50–150 W |
| Cooling (room AC overhead for the above) | ~30% of total |
Plan the room before the install, not after. Most labs that "added GPU compute later" end up tripping breakers in the first month.
Airflow matters as much as wattage. The K-AI servers use industrial front-to-back rack airflow with 120 mm fans and are explicitly designed for 24/7 sustained load. They are not desktop tower builds, and they will saturate a small room's HVAC. Either rack them in a closet with dedicated cooling, or put them in a separate room and run a 10 GbE link.
Software stack — what actually runs
This is where the implementation lives. The high-level picture:
Ubuntu 22.04 + ROS 2 Humble
- ROS 2 nodes: perception, control, command router
- Local lightweight models (YOLO, MobileNet, wake-word)
- gRPC client to the inference server
- Manufacturer SDK (Unitree, Booster, EngineAI)
Ubuntu 22.04 + CUDA 12.x / 13.x
- Inference: vLLM, SGLang, llama.cpp, or NVIDIA Triton
- VLM / LLM model weights (HuggingFace, NGC, internal)
- Scene memory store: ChromaDB or pgvector
- Optional training: PyTorch, accelerate, peft, Isaac Sim
- Monitoring: Prometheus, Grafana (latency, GPU util, queue depth)
Software split: ROS 2 on the robot communicates over gRPC to the inference server running vLLM / SGLang.
A couple of opinionated calls:
- vLLM is the default serving stack for transformer-based VLMs and LLMs. It is faster than naive HuggingFace inference and supports continuous batching and prefix caching. SGLang is a strong alternative if you are doing structured output or agent-style workflows.
- llama.cpp is the right answer when you are running a small model (7B–13B class) on a GPU that vLLM does not love (e.g. RTX 4090 with quirky tensor parallelism), or on the robot itself.
- NVIDIA Triton is heavier to set up but is the right call if you are mixing model types (LLM + vision + speech) and want one serving layer over all of them.
- ROS 2 Humble is the lingua franca. Manufacturer SDKs (Unitree, Booster, EngineAI) ship ROS 2 wrappers. Build your integration on the ROS 2 side, not the manufacturer's proprietary protocol, unless you have a specific reason.
A concrete example
Imagine the simplest viable production setup: one humanoid (Unitree G1), one inference server (K-AI 256 Turin Dual with 8× RTX 5090), one switch, one AP, in a 30 m² lab.
Hardware:
- Unitree G1 (one unit, ~1 kW charging draw)
- K-AI 256 Turin Dual / 8× RTX 5090 (sustained ~4 kW)
- 10 GbE switch (5 ports, ~30 W)
- Wi-Fi 6E AP (~15 W)
- 32 A three-phase or 2× 16 A single-phase circuit
- ~6 kW dedicated cooling (split AC)
Software on server:
- Ubuntu 22.04, CUDA 13, Docker
- vLLM serving Qwen2.5-VL 72B at INT4
- vLLM serving Qwen2.5 32B (text-only, for planning)
- pgvector for scene memory
- Isaac Sim for policy work
- Prometheus + Grafana
Software on robot:
- Unitree's ROS 2 driver
- Custom command-router ROS 2 node
- gRPC client to vLLM endpoints
- Local YOLO for fast obstacle detection
This is a real, viable architecture you can buy and stand up in two to three weeks. The total budget for the compute side (server, networking, electrical, cooling delta) lands in the €60k–€90k range depending on configuration. The robot is its own line item.
For a research lab with 2–4 robots, the same server scales — vLLM handles concurrent requests fine, and the bottleneck becomes either GPU memory (if you want to host more models simultaneously) or wall power.
What breaks
Honest list of failure modes we have seen, in rough order of how often they bite:
- Wi-Fi jitter under load. A previously-fine network gets congested when a new device joins, latency spikes from 5 ms to 80 ms, and the robot's reactive behavior degrades. Fix: dedicated SSID for robotics, 6 GHz only, line-of-sight AP.
- Model swap takes the system down. You update the VLM, vLLM has to reload, the robot's command pipeline times out. Fix: blue/green serving, two endpoints, swap one at a time.
- Server thermal throttle. Under sustained training + inference, the room AC cannot keep up, and the GPUs throttle. Inference latency doubles silently. Fix: size cooling for 1.3× sustained, monitor GPU temps, alarm at 80 °C.
- Cable / connector failures on the robot side. Robots vibrate. Wires fatigue. Plan for one network or sensor cable failure per robot per quarter on heavy-use units. Keep spares.
-
NVIDIA driver / CUDA version mismatch after an apt-get upgrade. This bites everyone exactly once. Pin your driver version, use containers, do not
apt-get dist-upgradeon a working server. - Clock drift between robot and server. Sensor timestamps drift by tens of milliseconds, sensor fusion produces garbage, no one understands why. Fix: local NTP, monitor it, alarm on drift > 5 ms.
None of these are catastrophic. All of them are predictable once you have seen them once.
When the on-prem path is the wrong answer
To stay honest: the on-prem inference server is the wrong answer when —
- Your robot is purely a teleoperated unit and your operator is sitting at a cloud workstation anyway. Just use the cloud.
- You are deploying one quadruped for an outdoor inspection task and there is no need for large model inference; the on-board compute does the job.
- Your model footprint actually fits on the Jetson Orin AGX (50–70 W TDP) and your latency budget allows on-board. The Orin can run a 7B INT4 LLM at usable speed.
- You are running a one-off demo. Cloud is faster to set up, cheaper at zero load, and you can tear it down in an afternoon.
The on-prem path is right when the deployment is sustained, the data is sensitive, the latency budget is tight, or the customization needs are real. Most serious robotics work in 2026 hits at least one of those.
What to do next
If you are evaluating an on-prem build, the questions worth answering before spending money are:
- What models do you need to host? Write down names and parameter counts. This sizes the GPU memory.
- How many concurrent inference requests, peak and sustained? This sizes the GPU count.
- Do you need training capacity, or only inference? Training roughly triples the GPU and storage budget.
- What is your power and cooling envelope? Be honest. Most labs find out the hard way that they cannot deliver 5 kW continuous.
- Do you have an integrator? Manufacturer SDKs are good but the glue between the robot, the inference server, and your application is real work. Either staff for it or hire it.
The follow-up articles in this series will go deeper on specific pieces: model serving (I02), network topology (I03), the actual reference build with parts list and benchmarks (I05), and fleet deployment (I06).
This is part of the Kentino Wiki, a reference series on AI compute, robotics, and the systems that connect them. Comments and corrections welcome at info@kentino.com.