On-Device vs Off-Device Compute for Robots
Share
Every workload on a robot lives in one of two places: on-board, riding the battery and the chest fans, or off-board, on a wall-powered server one LAN hop away. The hard question is not whether you need both — you almost always do — but which workload goes where, and why. R08 makes the case that a dedicated edge tier exists at all. This article is the per-workload decision: motor control here, VLM there, STT depends, and so on.
The frame is opinionated. There is a right answer per workload, and the right answer changes as on-board silicon catches up. We will say which way each piece is moving in 2026–2027.
The two budgets that decide everything
A workload can run on-board if and only if it fits in two budgets simultaneously: power and memory. Latency is the third constraint, but latency usually tells you which way to push a workload that fits, not whether it fits at all.
Power budget on a humanoid. A 30–50 kg humanoid carries 700–900 Wh of battery. Sustained walking draws 200–500 W from the actuators, peaks of 800 W+ during dynamic motion. Standing still and thinking draws 80–150 W from the actuators (holding torque is cheap, holding posture costs). That leaves roughly 60–120 W for everything else — compute, sensors, cooling fans, radio — if you want a usable two-hour runtime. The compute budget inside that envelope is 30–80 W sustained. Burning the Jetson at MAX_N (60 W) all the time halves the robot's runtime. This is non-negotiable physics, not a design choice.
Memory budget. A Jetson AGX Orin 64 GB is 64 GB of LPDDR5 shared between the CPU and the GPU. Subtract the OS (4–6 GB), ROS 2 and the manufacturer SDK (2–4 GB), perception buffers (2–4 GB for stereo depth and point clouds), and any application code. The realistic LLM/VLM-usable budget is 40–48 GB. On a Jetson Thor 128 GB module the ceiling rises to ~96 GB usable. On Orin NX 16 GB it collapses to ~8–10 GB usable — enough for one small model.
Off-board, both budgets are different problems. A K-AI 96 (4× RTX 5090) has 128 GB of dedicated VRAM and 1.0–1.5 kW of GPU power under wall power. A K-AI 256 (8× RTX 5090) doubles both. Power and memory simply stop being constraints — you trade them for capex, room HVAC, and wired networking instead.
On-board silicon, 2026 reality
The on-board options that actually ship in serious humanoids and quadrupeds today:
| SoC | INT8 TOPS (sparse) | Usable memory | Power envelope | What it can run |
|---|---|---|---|---|
| Jetson Orin NX 16 GB | 100 | ~10 GB shared | 10–25 W | YOLO, small VLM 3B Q4, wake-word |
| Jetson AGX Orin 64 GB | 275 | ~40 GB shared | 15–60 W | YOLO, VLM 7B Q4 at usable rates, 13B LLM |
| Jetson AGX Thor 128 GB (2026) | 2070 FP4 TFLOPS | ~96 GB shared | 40–130 W | VLM 32B Q4 with room to spare, dual-stream perception |
| Snapdragon 8 Gen 3 / QRB-class | 45–75 | ~6–10 GB shared | 5–15 W | Voice, wake-word, light CV |
| Hailo-8 (M.2 add-in) | 26 INT8 TOPS | 2–4 GB on-chip | 2.5 W typical | Vision pipeline offload, MobileNet/YOLO |
| Hailo-15 (vision SoC) | 20 TOPS | Pipeline-resident | 2–5 W | Always-on multi-camera CV |
| Intel N97 / i7-1370P co-processor | n/a (CPU + iGPU) | DDR5 host | 6–45 W | High-level orchestration, ROS 2, glue |
Thor is a step change, not an iteration. The 128 GB memory ceiling and the FP4-native Blackwell tensor cores move the on-board frontier from "7B VLM, painfully" to "32B VLM, usefully" in one generation. Workloads that were structurally off-board in 2025 — moderate-sized VLMs, continuous scene captioning, mid-size LLM planners — become on-board candidates in 2026 if the robot's thermal solution can sustain 100–130 W.
Hailo and the dedicated NPU class are not LLM accelerators. They are vision-pipeline offload: keep the main Jetson free for the bigger model, run the always-on camera CV (object detect, tracking, basic VLM-lite) on a 3 W chip.
Snapdragon QRB-class is the answer for voice-first robots — wake-word, beamforming, STT — where a Jetson is overkill and a Cortex-M is undersized. Useful, narrow, increasingly common.
Off-board: the K-AI tier as reference
The off-board target referenced throughout this series is the K-AI line — wall-powered, 4U or 5U rack, 4 or 8 GPUs on EPYC or Xeon, on a 10 GbE switch one hop from the robot. The three tiers that matter for robotics:
| Tier | GPUs | Aggregate VRAM | Sustained power | Largest realistic VLM (INT4) |
|---|---|---|---|---|
| K-AI 64 / 96 (4-GPU) | 4× RTX 5090 or 4× Pro 6000 | 128–384 GB | 1.8–2.4 kW | 72B with room for KV cache |
| K-AI 128 (4× Pro 6000 Blackwell) | 4× RTX Pro 6000 96 GB | 384 GB | 2.0–2.6 kW | 72B with full context |
| K-AI 256 (8-GPU) | 8× RTX 5090 or 8× Pro 6000 | 256–768 GB | 3.5–4.5 kW | 70B + 32B + 7B simultaneously |
The decision matrix
Every robotics workload sorts cleanly into one of five categories: always on-board, on-board preferred, either, off-board preferred, off-board only. The reason for each placement is the binding budget — power, memory, or latency.
| Workload | Placement | Binding constraint | Why |
|---|---|---|---|
| Motor control loop (500 Hz–1 kHz) | Always on-board | Latency (< 1 ms) | A network hop is 100–10000× the loop budget. No exceptions. |
| Joint safety reflexes / e-stop | Always on-board | Latency (< 5 ms) | Same reason; also must work when LAN is down. |
| IMU sensor fusion | Always on-board | Latency (< 5 ms) | Feeds the control loop directly. |
| Stereo / RGB-D depth | Always on-board | Bandwidth + latency | Raw camera bandwidth is too high to ship; depth output is small but needed fast. |
| YOLO-class object detection | On-board preferred | Latency (10–30 ms reflexive) | Fits a Jetson at 30+ FPS. No reason to network it. |
| Wake-word detection | On-board preferred | Power + latency | Always-on, sub-watt on QRB or Hailo. Off-board burns network bandwidth 24/7. |
| Whisper-class STT (small) | Either | Latency tolerance | 100–250 ms on Jetson AGX Orin. Off-board is 30–80 ms on a 5090. User can't tell. |
| Small LLM (≤ 8B Q4) for dialogue | Either | Latency vs concurrency | 15–25 tok/s on Orin AGX, 80–150 tok/s on a 5090. Off-board wins if you need scale. |
| VLM 3B Q4 (scene captioning) | On-board preferred | Power + latency | Fits Orin easily, no network needed, ~10 FPS usable. |
| VLM 7B Q4 (Qwen2.5-VL, OpenVLA) | Either | Power vs frame rate | Orin AGX runs 5–8 FPS at 30 W. K-AI runs 30+ FPS. Pick per-task. |
| VLM 32B Q4 | Off-board preferred (Thor: marginal on-board) | Memory + power | Doesn't fit Orin AGX. Marginal on Thor 128 GB. Comfortable on K-AI 96. |
| VLM 70B+ Q4 | Off-board only | Memory | 45–50 GB weights + 10–20 GB KV. No on-board module hosts this in 2026. |
| Motion planner (short horizon, < 1 s) | On-board preferred | Latency | Tight loop with control. Local is the right answer. |
| Motion planner (long horizon, multi-s) | Off-board preferred | Memory + model size | VLM or diffusion-based, large context, off-board wins. |
| Scene memory / RAG | Off-board only | Persistence + memory | Must survive robot reboot; vector store wants real storage and CPU RAM. |
| Multi-camera VLM fusion (3–5 streams) | Off-board only | Memory + compute | Batching across cameras needs a multi-GPU server. |
| Fine-tuning / LoRA training | Off-board only | Memory + power | 2–5× inference memory, sustained kW-class power. Not happening on a battery. |
| Full pre-training | Off-board only | Doesn't even fit one server | Multi-node territory. See the K track. |
| Isaac Sim policy iteration | Off-board only | GPU rendering + RL throughput | Inherently a server workload. |
Three readings of this table worth pulling out.
The "always on-board" rows are physics. No quantity of bandwidth fixes a control loop that needs 1 kHz response. These never move, regardless of how good network or off-board compute gets.
The "either" rows are where actual engineering judgement happens. Most of the interesting trade-offs live here. Whether a 7B VLM runs on-board or off-board determines a real chunk of the system's behaviour. There is no global right answer.
The "off-board only" rows move slowly. A 72B model will probably still be off-board only in 2027. A 32B model will move on-board as Thor ships in volume. The frontier of "what fits on-device" advances roughly one model size class every 18–24 months.
Latency tier mapping
The four-tier latency budget from I01 maps to placement directly:
| Tier | Budget | Placement |
|---|---|---|
| Reactive control | < 10 ms | On-board only. Period. |
| Reflexive perception | 10–50 ms | On-board (LAN round-trip alone eats the budget) |
| Deliberative planning | 100 ms – 1 s | Either. LAN works, on-board works. |
| Strategic reasoning | 1 s – multi-s | Either, off-board often wins on model quality. |
The two middle tiers are where the real architecture decisions sit. A reflexive workload that just fits in 50 ms might fit off-board over a wired LAN (0.5 ms transit + 40 ms inference + 0.5 ms back = 41 ms) but blow the budget over Wi-Fi 6E under load (8 ms × 2 + 40 ms = 56 ms, plus jitter). This is why wired tethers matter during development: they let you see whether a workload fundamentally fits off-board before you fight the wireless network.
Network reality
Off-board only works if the network does. The actual numbers:
| Link | Median RTT | P99 RTT | Jitter under load |
|---|---|---|---|
| Wired 2.5 / 10 GbE tether | 0.2–0.5 ms | 0.5–1 ms | Sub-ms |
| Wi-Fi 6E, 6 GHz, line-of-sight, dedicated AP | 3–10 ms | 15–30 ms | Manageable |
| Wi-Fi 6 / 6E, shared 5 GHz, contested | 8–25 ms | 80–200 ms | Bad |
| Wi-Fi 6E under load (file transfer in same SSID) | 10–40 ms | 200 ms – 2 s | Robot-breaking |
| Cellular 5G mid-band | 20–40 ms | 100–300 ms | Variable |
| WAN to EU cloud | 15–40 ms (median) | 80–300 ms | BGP-dependent |
For deliberative workloads (100 ms – 1 s budget), Wi-Fi 6E on a dedicated SSID and a line-of-sight AP is fine. For reflexive workloads (10–50 ms), wired is the safe answer and Wi-Fi 6E is the lucky answer. Plan accordingly: if a robot's task has any reflexive off-board component, design with a wired option for development and a backup mode for when Wi-Fi degrades.
The hybrid pattern — fast/slow split
The pattern that most serious 2026 deployments converge on is fast/slow split VLM: a small VLM on-board for instant feedback, a large VLM off-board for considered decisions, both feeding the same planner.
Concrete shape:
- 33 ms cadence per frame
- Feeds both on-board and off-board VLM paths
- Qwen2.5-VL-7B on Jetson AGX Orin / Thor
- 5–10 Hz output
- Scene summary + immediate action
- Feeds reflexive layer directly
gRPC
- Qwen2.5-VL-72B on K-AI 96 / 256
- 1–3 Hz output
- Considered scene reasoning + plan adjustment
- Feeds deliberative layer, can override on-board
Fast/slow split: on-board 7B handles immediate response; off-board 72B handles considered reasoning and plan updates.
The fast model handles "there is a person walking toward me, slow down". The slow model handles "the person walking toward me is the operator, who asked me to pause if they approach with the red clipboard, and that is a red clipboard, so pause". The fast model commits to safe behaviour first; the slow model upgrades it.
The migration story — what moves on-board in 2026–2027
Three forces compress the off-board side: Thor lands at volume (2070 FP4 TFLOPS, 128 GB unified memory, 130 W envelope — a structural jump from Orin); VLM efficiency keeps improving (a "good enough" perception VLM was 70B+ in 2024, is 32B in 2026, will plausibly be 13B–20B in 2027); and speculative decoding with small drafter models lets you serve big-model quality at small-model latency.
Workloads most likely to migrate on-board over the next 18 months: 7B-class scene VLM (already moving), 13B LLM planner (Thor makes it comfortable), short-context STT-LLM-TTS for voice, domain-specific fine-tuned 7B VLMs.
Workloads that do not migrate: anything 70B+, fleet-scale scene memory, training and simulation, multi-stream VLM fusion. These remain off-board through 2027 minimum. Memory bandwidth and battery power do not bend that fast.
Two concrete configurations
Single G1 EDU, research lab. On-board AGX Orin runs ROS 2, YOLOv11-s at 30 FPS, Whisper-distil STT, and Qwen2.5-VL 7B INT4 at 5–8 FPS (deliberative). Off-board K-AI 96 (4× RTX 5090, 128 GB VRAM) runs vLLM with Qwen2.5-VL 72B and Qwen2.5 32B text-only, pgvector scene memory, and Isaac Sim on idle GPUs.
Small fleet, 3 humanoids, Thor-class on-board. Each unit runs a Hailo-15 vision SoC for always-on multi-cam detect, on-device Whisper-small, and Qwen2.5-VL 32B INT4 at 10–15 FPS — natively, no longer marginal. Off-board K-AI 256 (8× RTX 5090, 256 GB VRAM) hosts a shared Qwen2.5-VL 72B and Llama-3.1 70B for planning, a shared pgvector store across all three robots, and two GPUs reserved for overnight LoRA fine-tuning.
Where pure single-tier deployments do work
Pure on-board works for narrow-task quadrupeds (perimeter patrol with YOLO + thermal cam), teleoperated robots (operator is the planner, no VLM in loop), demo / education robots with the smallest models, and any deployment with no network at all (outdoor inspection, remote sites).
Pure cloud works for voice-only robotic assistants without a closed perception loop, prototypes where setup speed beats everything, and deployments where the data is intentionally going to a cloud workload anyway.
Hybrid is the answer for everything in between — which is roughly 90% of serious robotics in 2026. If you are building anything with VLM-in-the-loop perception, you will end up with both tiers. Plan for it from day one.
Decision flow
When you are sizing a deployment:
- List the workloads. Motor control, perception, STT, LLM, VLM, planning, memory, training. Be explicit.
- Tag each with a latency tier and a memory class (≤ 8 GB / 8–40 GB / 40–80 GB / 80+ GB).
- Apply the matrix above to get a first-pass placement.
- Audit on-board against the robot's actual SoC and thermal budget. A G1 with Orin NX is a different machine from a T1 with AGX Orin or a future Thor platform.
- Audit off-board against the K-AI tier you can afford. K-AI 96 is the floor for one robot doing serious work; K-AI 256 is the floor for a fleet or any training requirement.
- Audit the network. Wired tether for development, Wi-Fi 6E dedicated SSID for production. If your reflexive layer crosses the network, plan for wired or a graceful-degrade fallback.
- Plan migration. When Thor ships in your platform, which workloads move on-board? Abstract the gRPC boundary now so the migration is a deployment change, not a rewrite.
The follow-up articles (R05, I02, I05) go deeper on the pieces sketched here. The placement decisions in this article are the scaffolding everything else hangs on.
The honest take: 90% of serious robotics deployments in 2026 need both tiers. On-board is sized for safety, reflex, and small-model perception. Off-board is sized for VLM-in-the-loop, scene memory, planning, and training. Single-tier deployments work for narrow use cases — and only those.
This is part of the Kentino Wiki, a reference series on AI compute, robotics, and the systems that connect them. Comments and corrections welcome at info@kentino.com.