On-Device vs Off-Device Compute for Robots

11 czerwiec 2026

Every workload on a robot lives in one of two places: on-board, riding the battery and the chest fans, or off-board, on a wall-powered server one LAN hop away. The hard question is not whether you need both — you almost always do — but which workload goes where, and why. R08 makes the case that a dedicated edge tier exists at all. This article is the per-workload decision: motor control here, VLM there, STT depends, and so on.

The frame is opinionated. There is a right answer per workload, and the right answer changes as on-board silicon catches up. We will say which way each piece is moving in 2026–2027.

The two budgets that decide everything

A workload can run on-board if and only if it fits in two budgets simultaneously: power and memory. Latency is the third constraint, but latency usually tells you which way to push a workload that fits, not whether it fits at all.

Power budget on a humanoid. A 30–50 kg humanoid carries 700–900 Wh of battery. Sustained walking draws 200–500 W from the actuators, peaks of 800 W+ during dynamic motion. Standing still and thinking draws 80–150 W from the actuators (holding torque is cheap, holding posture costs). That leaves roughly 60–120 W for everything else — compute, sensors, cooling fans, radio — if you want a usable two-hour runtime. The compute budget inside that envelope is 30–80 W sustained. Burning the Jetson at MAX_N (60 W) all the time halves the robot's runtime. This is non-negotiable physics, not a design choice.

Memory budget. A Jetson AGX Orin 64 GB is 64 GB of LPDDR5 shared between the CPU and the GPU. Subtract the OS (4–6 GB), ROS 2 and the manufacturer SDK (2–4 GB), perception buffers (2–4 GB for stereo depth and point clouds), and any application code. The realistic LLM/VLM-usable budget is 40–48 GB. On a Jetson Thor 128 GB module the ceiling rises to ~96 GB usable. On Orin NX 16 GB it collapses to ~8–10 GB usable — enough for one small model.

Off-board, both budgets are different problems. A K-AI 96 (4× RTX 5090) has 128 GB of dedicated VRAM and 1.0–1.5 kW of GPU power under wall power. A K-AI 256 (8× RTX 5090) doubles both. Power and memory simply stop being constraints — you trade them for capex, room HVAC, and wired networking instead.

On-board silicon, 2026 reality

The on-board options that actually ship in serious humanoids and quadrupeds today:

SoC	INT8 TOPS (sparse)	Usable memory	Power envelope	What it can run
Jetson Orin NX 16 GB	100	~10 GB shared	10–25 W	YOLO, small VLM 3B Q4, wake-word
Jetson AGX Orin 64 GB	275	~40 GB shared	15–60 W	YOLO, VLM 7B Q4 at usable rates, 13B LLM
Jetson AGX Thor 128 GB (2026)	2070 FP4 TFLOPS	~96 GB shared	40–130 W	VLM 32B Q4 with room to spare, dual-stream perception
Snapdragon 8 Gen 3 / QRB-class	45–75	~6–10 GB shared	5–15 W	Voice, wake-word, light CV
Hailo-8 (M.2 add-in)	26 INT8 TOPS	2–4 GB on-chip	2.5 W typical	Vision pipeline offload, MobileNet/YOLO
Hailo-15 (vision SoC)	20 TOPS	Pipeline-resident	2–5 W	Always-on multi-camera CV
Intel N97 / i7-1370P co-processor	n/a (CPU + iGPU)	DDR5 host	6–45 W	High-level orchestration, ROS 2, glue

Thor is a step change, not an iteration. The 128 GB memory ceiling and the FP4-native Blackwell tensor cores move the on-board frontier from "7B VLM, painfully" to "32B VLM, usefully" in one generation. Workloads that were structurally off-board in 2025 — moderate-sized VLMs, continuous scene captioning, mid-size LLM planners — become on-board candidates in 2026 if the robot's thermal solution can sustain 100–130 W.

Hailo and the dedicated NPU class are not LLM accelerators. They are vision-pipeline offload: keep the main Jetson free for the bigger model, run the always-on camera CV (object detect, tracking, basic VLM-lite) on a 3 W chip.

Snapdragon QRB-class is the answer for voice-first robots — wake-word, beamforming, STT — where a Jetson is overkill and a Cortex-M is undersized. Useful, narrow, increasingly common.

Off-board: the K-AI tier as reference

The off-board target referenced throughout this series is the K-AI line — wall-powered, 4U or 5U rack, 4 or 8 GPUs on EPYC or Xeon, on a 10 GbE switch one hop from the robot. The three tiers that matter for robotics:

Tier	GPUs	Aggregate VRAM	Sustained power	Largest realistic VLM (INT4)
K-AI 64 / 96 (4-GPU)	4× RTX 5090 or 4× Pro 6000	128–384 GB	1.8–2.4 kW	72B with room for KV cache
K-AI 128 (4× Pro 6000 Blackwell)	4× RTX Pro 6000 96 GB	384 GB	2.0–2.6 kW	72B with full context
K-AI 256 (8-GPU)	8× RTX 5090 or 8× Pro 6000	256–768 GB	3.5–4.5 kW	70B + 32B + 7B simultaneously

The decision matrix

Every robotics workload sorts cleanly into one of five categories: always on-board, on-board preferred, either, off-board preferred, off-board only. The reason for each placement is the binding budget — power, memory, or latency.

Workload	Placement	Binding constraint	Why
Motor control loop (500 Hz–1 kHz)	Always on-board	Latency (< 1 ms)	A network hop is 100–10000× the loop budget. No exceptions.
Joint safety reflexes / e-stop	Always on-board	Latency (< 5 ms)	Same reason; also must work when LAN is down.
IMU sensor fusion	Always on-board	Latency (< 5 ms)	Feeds the control loop directly.
Stereo / RGB-D depth	Always on-board	Bandwidth + latency	Raw camera bandwidth is too high to ship; depth output is small but needed fast.
YOLO-class object detection	On-board preferred	Latency (10–30 ms reflexive)	Fits a Jetson at 30+ FPS. No reason to network it.
Wake-word detection	On-board preferred	Power + latency	Always-on, sub-watt on QRB or Hailo. Off-board burns network bandwidth 24/7.
Whisper-class STT (small)	Either	Latency tolerance	100–250 ms on Jetson AGX Orin. Off-board is 30–80 ms on a 5090. User can't tell.
Small LLM (≤ 8B Q4) for dialogue	Either	Latency vs concurrency	15–25 tok/s on Orin AGX, 80–150 tok/s on a 5090. Off-board wins if you need scale.
VLM 3B Q4 (scene captioning)	On-board preferred	Power + latency	Fits Orin easily, no network needed, ~10 FPS usable.
VLM 7B Q4 (Qwen2.5-VL, OpenVLA)	Either	Power vs frame rate	Orin AGX runs 5–8 FPS at 30 W. K-AI runs 30+ FPS. Pick per-task.
VLM 32B Q4	Off-board preferred (Thor: marginal on-board)	Memory + power	Doesn't fit Orin AGX. Marginal on Thor 128 GB. Comfortable on K-AI 96.
VLM 70B+ Q4	Off-board only	Memory	45–50 GB weights + 10–20 GB KV. No on-board module hosts this in 2026.
Motion planner (short horizon, < 1 s)	On-board preferred	Latency	Tight loop with control. Local is the right answer.
Motion planner (long horizon, multi-s)	Off-board preferred	Memory + model size	VLM or diffusion-based, large context, off-board wins.
Scene memory / RAG	Off-board only	Persistence + memory	Must survive robot reboot; vector store wants real storage and CPU RAM.
Multi-camera VLM fusion (3–5 streams)	Off-board only	Memory + compute	Batching across cameras needs a multi-GPU server.
Fine-tuning / LoRA training	Off-board only	Memory + power	2–5× inference memory, sustained kW-class power. Not happening on a battery.
Full pre-training	Off-board only	Doesn't even fit one server	Multi-node territory. See the K track.
Isaac Sim policy iteration	Off-board only	GPU rendering + RL throughput	Inherently a server workload.

Three readings of this table worth pulling out.

The "always on-board" rows are physics. No quantity of bandwidth fixes a control loop that needs 1 kHz response. These never move, regardless of how good network or off-board compute gets.

The "either" rows are where actual engineering judgement happens. Most of the interesting trade-offs live here. Whether a 7B VLM runs on-board or off-board determines a real chunk of the system's behaviour. There is no global right answer.

The "off-board only" rows move slowly. A 72B model will probably still be off-board only in 2027. A 32B model will move on-board as Thor ships in volume. The frontier of "what fits on-device" advances roughly one model size class every 18–24 months.

Latency tier mapping

The four-tier latency budget from I01 maps to placement directly:

Tier	Budget	Placement
Reactive control	< 10 ms	On-board only. Period.
Reflexive perception	10–50 ms	On-board (LAN round-trip alone eats the budget)
Deliberative planning	100 ms – 1 s	Either. LAN works, on-board works.
Strategic reasoning	1 s – multi-s	Either, off-board often wins on model quality.

The two middle tiers are where the real architecture decisions sit. A reflexive workload that just fits in 50 ms might fit off-board over a wired LAN (0.5 ms transit + 40 ms inference + 0.5 ms back = 41 ms) but blow the budget over Wi-Fi 6E under load (8 ms × 2 + 40 ms = 56 ms, plus jitter). This is why wired tethers matter during development: they let you see whether a workload fundamentally fits off-board before you fight the wireless network.

Network reality

Off-board only works if the network does. The actual numbers:

Link	Median RTT	P99 RTT	Jitter under load
Wired 2.5 / 10 GbE tether	0.2–0.5 ms	0.5–1 ms	Sub-ms
Wi-Fi 6E, 6 GHz, line-of-sight, dedicated AP	3–10 ms	15–30 ms	Manageable
Wi-Fi 6 / 6E, shared 5 GHz, contested	8–25 ms	80–200 ms	Bad
Wi-Fi 6E under load (file transfer in same SSID)	10–40 ms	200 ms – 2 s	Robot-breaking
Cellular 5G mid-band	20–40 ms	100–300 ms	Variable
WAN to EU cloud	15–40 ms (median)	80–300 ms	BGP-dependent

For deliberative workloads (100 ms – 1 s budget), Wi-Fi 6E on a dedicated SSID and a line-of-sight AP is fine. For reflexive workloads (10–50 ms), wired is the safe answer and Wi-Fi 6E is the lucky answer. Plan accordingly: if a robot's task has any reflexive off-board component, design with a wired option for development and a backup mode for when Wi-Fi degrades.

The hybrid pattern — fast/slow split

The pattern that most serious 2026 deployments converge on is fast/slow split VLM: a small VLM on-board for instant feedback, a large VLM off-board for considered decisions, both feeding the same planner.

Concrete shape:

Camera frame — 30 FPS

33 ms cadence per frame
Feeds both on-board and off-board VLM paths

↓

On-board VLM 7B Q4

Qwen2.5-VL-7B on Jetson AGX Orin / Thor
5–10 Hz output
Scene summary + immediate action
Feeds reflexive layer directly

⇄
gRPC

Off-board VLM 72B Q4

Qwen2.5-VL-72B on K-AI 96 / 256
1–3 Hz output
Considered scene reasoning + plan adjustment
Feeds deliberative layer, can override on-board

Fast/slow split: on-board 7B handles immediate response; off-board 72B handles considered reasoning and plan updates.

The fast model handles "there is a person walking toward me, slow down". The slow model handles "the person walking toward me is the operator, who asked me to pause if they approach with the red clipboard, and that is a red clipboard, so pause". The fast model commits to safe behaviour first; the slow model upgrades it.

The migration story — what moves on-board in 2026–2027

Three forces compress the off-board side: Thor lands at volume (2070 FP4 TFLOPS, 128 GB unified memory, 130 W envelope — a structural jump from Orin); VLM efficiency keeps improving (a "good enough" perception VLM was 70B+ in 2024, is 32B in 2026, will plausibly be 13B–20B in 2027); and speculative decoding with small drafter models lets you serve big-model quality at small-model latency.

Workloads most likely to migrate on-board over the next 18 months: 7B-class scene VLM (already moving), 13B LLM planner (Thor makes it comfortable), short-context STT-LLM-TTS for voice, domain-specific fine-tuned 7B VLMs.

Workloads that do not migrate: anything 70B+, fleet-scale scene memory, training and simulation, multi-stream VLM fusion. These remain off-board through 2027 minimum. Memory bandwidth and battery power do not bend that fast.

Two concrete configurations

Single G1 EDU, research lab. On-board AGX Orin runs ROS 2, YOLOv11-s at 30 FPS, Whisper-distil STT, and Qwen2.5-VL 7B INT4 at 5–8 FPS (deliberative). Off-board K-AI 96 (4× RTX 5090, 128 GB VRAM) runs vLLM with Qwen2.5-VL 72B and Qwen2.5 32B text-only, pgvector scene memory, and Isaac Sim on idle GPUs.

Small fleet, 3 humanoids, Thor-class on-board. Each unit runs a Hailo-15 vision SoC for always-on multi-cam detect, on-device Whisper-small, and Qwen2.5-VL 32B INT4 at 10–15 FPS — natively, no longer marginal. Off-board K-AI 256 (8× RTX 5090, 256 GB VRAM) hosts a shared Qwen2.5-VL 72B and Llama-3.1 70B for planning, a shared pgvector store across all three robots, and two GPUs reserved for overnight LoRA fine-tuning.

Where pure single-tier deployments do work

Pure on-board works for narrow-task quadrupeds (perimeter patrol with YOLO + thermal cam), teleoperated robots (operator is the planner, no VLM in loop), demo / education robots with the smallest models, and any deployment with no network at all (outdoor inspection, remote sites).

Pure cloud works for voice-only robotic assistants without a closed perception loop, prototypes where setup speed beats everything, and deployments where the data is intentionally going to a cloud workload anyway.

Hybrid is the answer for everything in between — which is roughly 90% of serious robotics in 2026. If you are building anything with VLM-in-the-loop perception, you will end up with both tiers. Plan for it from day one.

Decision flow

When you are sizing a deployment:

List the workloads. Motor control, perception, STT, LLM, VLM, planning, memory, training. Be explicit.
Tag each with a latency tier and a memory class (≤ 8 GB / 8–40 GB / 40–80 GB / 80+ GB).
Apply the matrix above to get a first-pass placement.
Audit on-board against the robot's actual SoC and thermal budget. A G1 with Orin NX is a different machine from a T1 with AGX Orin or a future Thor platform.
Audit off-board against the K-AI tier you can afford. K-AI 96 is the floor for one robot doing serious work; K-AI 256 is the floor for a fleet or any training requirement.
Audit the network. Wired tether for development, Wi-Fi 6E dedicated SSID for production. If your reflexive layer crosses the network, plan for wired or a graceful-degrade fallback.
Plan migration. When Thor ships in your platform, which workloads move on-board? Abstract the gRPC boundary now so the migration is a deployment change, not a rewrite.

The follow-up articles (R05, I02, I05) go deeper on the pieces sketched here. The placement decisions in this article are the scaffolding everything else hangs on.

The honest take: 90% of serious robotics deployments in 2026 need both tiers. On-board is sized for safety, reflex, and small-model perception. Off-board is sized for VLM-in-the-loop, scene memory, planning, and training. Single-tier deployments work for narrow use cases — and only those.

This is part of the Kentino Wiki, a reference series on AI compute, robotics, and the systems that connect them. Comments and corrections welcome at info@kentino.com.

Powrót do blogu

Pozycję dodano do koszyka