Why Robots Need Dedicated Edge Compute

28 май 2026 г.

A humanoid that walks is a solved problem. A humanoid that understands what it is looking at, plans across a multi-step task, and remembers what happened five minutes ago is not. The compute that closes that gap does not fit on the robot, and the cloud is the wrong place to put it. This article is the technical case for why a dedicated on-prem inference server is the bleeding-edge answer in 2026 for any robotics deployment serious enough to do useful work.

I01 covers how the robot and the server are wired together. This article is why you would buy the server at all. If you are still asking whether on-prem makes sense, read this first.

The decision-latency budget

Every action a robot takes sits in one of four latency tiers. Each tier has a budget the physics of the task imposes — miss it and the behaviour breaks in a specific, recognisable way.

Latency tiers — decision budget by control layer

Tier	Budget	Examples	What breaks if you miss it
Reactive control	< 10 ms	Joint torque, balance, motor commutation, e-stop	Robot falls over, oscillates, damages itself
Reflexive perception	10–50 ms	Obstacle detect, contact response, fast tracking	Collisions, missed grasps, dropped objects
Deliberative planning	100 ms – 1 s	"Pick the red cup," scene understanding, dialogue ack	Awkward pauses, conversational latency, jerky task transitions
Strategic reasoning	1 s – multi-s	Multi-step task plans, error recovery, long dialogue	Acceptable; user perceives "thinking"

These are not arbitrary. They come from the bandwidth of the closed loop they sit inside.

Reactive control runs at 500 Hz to 1 kHz on the joint level because the dynamics of a 30 kg humanoid leg demand it — at 100 Hz the limb resonates and the gait diverges. Reflexive perception inherits camera frame rates (30–60 FPS = 16–33 ms per frame) and the timescale of physical contact (a finger pressing on an object closes a contact event in 20–40 ms). Deliberative planning is bounded by human conversational expectations (one second feels responsive, two feels slow) and by the model latency of a useful VLM. Strategic reasoning is the only tier with real slack, which is why everyone wants to push planning into it and gets bitten when their VLM cannot keep up.

This is not a Kentino opinion. The four-tier budget is the partition every credible robot control stack — ROS 2, Isaac, Drake, OCS2 — bakes in. What differs is what hardware you put behind each tier.

What the on-board compute can and cannot do

A 2026 humanoid carries an NVIDIA Jetson AGX Orin at the top of the spec sheet — 64 GB unified memory, up to 275 sparse INT8 TOPS (138 dense), 15–60 W configurable power. That is genuinely impressive for an embedded module. It is also nowhere near what a modern VLM-driven robot wants.

Run the math on three model classes you might actually want a humanoid to use:

On-board VRAM ceiling — model classes vs. Jetson AGX Orin 64 GB

Model	Params	Min VRAM (Q4 weights + activations)	On Orin AGX 64 GB?
Qwen2.5-VL 7B (INT4)	7B	~5–7 GB	Yes, ~5–8 FPS
OpenVLA 7B (BF16)	7.5B	~15 GB	Yes at INT4, ~3–6 Hz
NVIDIA Cosmos-Reason 7B	7B	~6–8 GB INT4	Yes, slow
Isaac GR00T N1.7	~3B	~16 GB recommended for inference	Marginal; fine-tune needs 40 GB+
Qwen2.5-VL 32B (INT4)	32B	~22–26 GB	Tight; usable but slow
Qwen2.5-VL 72B (INT4)	72B	~45–50 GB weights + 10–20 GB KV	No. Will OOM under any real context.
Llama-3.1 70B (INT4)	70B	~38–45 GB weights + KV	No on Orin under load

The Orin AGX 64 GB will host a 32B-class VLM at INT4 if you accept slow inference, no real batching, and no concurrent workloads. It will not host the 70B-class VLMs that are state-of-the-art for scene understanding in 2026 — Qwen2.5-VL 72B, the larger Cosmos variants, the proprietary models that vendors do not publish weights for. The combined weights, KV cache for long visual context, and any room for a second model do not fit.

There is a second number that gets glossed over: power. The Orin's 275 TOPS figure assumes MAX_N (60 W) mode. That is a battery-powered platform burning 60 W of compute on top of 200–800 W of actuator load. Sustained MAX_N halves the runtime of the robot. In practice the Orin spends most of its time in 30 W mode, which cuts the TOPS roughly in half and pushes already-marginal inference into "unusable" territory.

Translation: the on-board Jetson is sized for reactive and reflexive tiers. It is not sized to be a VLM host. Anyone telling you their humanoid "runs Qwen2.5-VL on-board" is either running the 3B or 7B model and calling it good, or running the bigger model at 0.5 FPS and calling it a demo. Both are valid for specific use cases. Neither is general-purpose robot perception.

Why cloud is the wrong answer for closed-loop robotics

Cloud inference is cheap, scales effortlessly, and requires zero capex. For a robot, it has four problems, in rough order of severity.

1. WAN latency floor. A well-tuned cloud call within the EU hits 15–40 ms round-trip on the WAN itself, plus 5–15 ms of TLS / HTTP / load-balancer overhead, plus model inference time, plus the trip back. Transatlantic adds 80–120 ms of round-trip on top. For a reflexive perception query — "is there an obstacle in front of me?" — adding 30–50 ms before the model even starts is a budget breach. For deliberative planning at 200–500 ms you might tolerate it, but every dropped packet, every retransmission, every cell-tower failover spikes you into the next tier up.

2. Jitter. WAN RTT is not a constant. It is a distribution. Median 25 ms, P99 250 ms, P99.9 several seconds. A robot acting in the real world cannot accept a several-second pause because a BGP route flapped. On-prem LAN P99.9 is 1–2 ms.

3. Cost at sustained load. A single 70B VLM inference costs a cloud provider almost nothing — they charge a few cents. A robot that is "always perceiving" makes one VLM call every 100–500 ms while active. That is 7,000–36,000 calls per hour, per robot. A fleet of three robots running eight hours a day at the high end is 850,000 calls. At even $0.005 per call on a hosted 72B endpoint, that is $4,250 per day, $125k per month. An on-prem 8× GPU server pays back in under three months at that load.

4. Data sovereignty. The robot sees a factory floor, a patient room, a research lab, a warehouse with proprietary inventory layout, a military training ground. That video is privileged or regulated under GDPR, HIPAA, ITAR, or simple competitive sensitivity. Shipping it to a third-party cloud — even one that signs a DPA — is either prohibited or a non-trivial compliance burden. On-prem inference makes the data sovereignty question vanish: the bytes never leave the building.

There is a fifth, softer problem: vendor lock-in. Cloud APIs serve the models the provider wants to serve, with the quantizations and context windows the provider chose. You cannot run a fine-tuned VLM on OpenAI's endpoint. You cannot pin a specific commit of an open-weight model. You cannot mix models from competing vendors in one pipeline. For prototyping these constraints are fine. For a production robotics deployment that has to be predictable for years, they are not.

The dedicated-edge-server case

A dedicated on-prem inference server sits in the LAN, one or two hops from the robot. For the Kentino K-AI line that is a 4U rack server, EPYC or Xeon host, 4× or 8× GPU, on a 10 GbE switch. The numbers it brings to the table:

Tier comparison — on-board vs. cloud API vs. on-prem K-AI server

Property	On-board (Jetson)	Cloud API	On-prem K-AI server
LAN/WAN round-trip	n/a (in-process)	15–120 ms WAN	0.2–0.5 ms LAN
Largest VLM that fits (INT4)	7B–13B realistic	Provider's choice	Up to 72B+ on 8× 5090 / 8× Pro 6000
Concurrent models	1, maybe 2 small	1 per endpoint	3–6 simultaneously (VLM + LLM + memory + STT)
Sustained throughput	Throttled to 30 W	Rate-limited	Wall-power limited only
Customisation	Whatever you ship	Whatever provider hosts	Any open-weight model, any quantization, any fine-tune
Data egress	None	Every request	None (firewall the box)
Cost at sustained load	Sunk (battery)	Linear in calls	Sunk (capex + power)
Failure mode	Local	WAN-dependent	LAN-dependent (recoverable)

The two columns the on-prem option dominates are sustained throughput and model choice. Those are also the two columns that matter most for the workloads we are about to enumerate.

A representative K-AI 256 Turin Dual with 8× RTX 5090 has 256 GB of aggregate VRAM and 1.0–1.5 kW of GPU power under load. That is enough to host, simultaneously:

Qwen2.5-VL 72B at INT4 (~45–50 GB weights + 10–20 GB KV per GPU pair, tensor-parallel across 4 GPUs)
Qwen2.5 32B text-only (planning, dialogue) on 2 GPUs
A small VLA (OpenVLA 7B or Cosmos-Reason 7B) on 1 GPU for motion intent
Scene memory / RAG store (pgvector or ChromaDB) on the host CPU
Online policy fine-tuning capacity on the remaining GPU when the robot is docked

VLM TTFT for Qwen2.5-VL 72B on this hardware lands in the 200–400 ms range at single-request load, ramping to 1–4 seconds under heavy concurrent load. Token streaming is 25–50 tok/s. That is enough to put a 72B VLM in the deliberative-planning tier (100 ms – 1 s) at single-robot load, and in the strategic-reasoning tier (1 s – multi-s) for a small fleet. Neither tier is a problem; the reactive and reflexive tiers stay on-board where they belong.

Capability gap: what only the dedicated edge tier serves

The on-prem server is not just "the same robot, but faster". It enables a class of workloads that the other two tiers structurally cannot host. The honest list:

1. Real-time scene-understanding feedback. A 72B VLM looking at the robot's camera every 200–500 ms, returning a structured description of the scene that the planner consumes. Cloud cannot do this at scale because of WAN jitter and cost. On-board cannot do this because the model does not fit. On-prem closes the loop in 250–500 ms total.

2. Multi-camera VLM fusion. A humanoid has 3–5 cameras (head, two wrists, two body / chest). Running a VLM across all of them simultaneously — for spatial grounding, occlusion handling, or hand-eye coordination — is 5× the inference load. Cloud rate-limits or charges per stream. On-board fits one stream at small scale. On-prem batches all five through the same VLM endpoint.

3. Long-horizon task planning with persistent scene memory. "Yesterday I left the wrench on the second shelf. Find it." This requires a VLM + LLM + vector store running together with persistent state across robot sessions. The state has to live somewhere stable, queryable, and fast. That is a database on the server, not a per-call cloud context window, and not 4 GB of RAM on the Jetson.

4. Online policy fine-tuning. The robot collects task demonstrations during the day. Overnight, while it is docked, you run LoRA fine-tunes on the day's data against a base VLA, push the updated adapter, and the robot is better tomorrow. This is a 2× to 5× memory footprint over inference. Cloud charges training and storage separately. On-prem absorbs it into the same box.

5. Multi-robot fleet coordination. Two or three robots sharing a scene memory, coordinating on tasks, watching each other's state. The cross-robot coordination layer wants sub-10 ms latency between robots. On-prem with a shared server on the LAN delivers that. Cloud cannot — every robot's update goes out to a region, comes back, hits the next robot.

6. Sim-to-real iteration. Isaac Sim running on the same GPUs that serve inference, generating synthetic training data, validating policy updates before they hit the real robot. This is a half-day per iteration on cloud (data movement alone), a 30-minute loop on-prem.

None of these are sci-fi. All of them are workloads that 2026 robotics integrators are running today. None of them work cleanly on either of the other two tiers.

Why this is the bleeding-edge answer in 2026

The state-of-the-art in robot perception in 2026 is VLM-in-the-loop. The model looks at the world, reasons about it in language, emits structured plans, and the policy executes them. This was a research idea in 2023, a product demo in 2024, and is a production pattern in 2025–2026.

The trend forcing the on-prem tier is straightforward: the VLMs that work are getting bigger, not smaller. Qwen2.5-VL 7B is good. Qwen2.5-VL 72B is meaningfully better. The proprietary models the frontier labs do not publish are bigger still. The "small VLM that runs on-device" path exists and will keep existing, but it lags the frontier by 12–18 months and a meaningful capability gap. If you want the frontier behaviour, you host the frontier model. The frontier model does not fit on a Jetson.

Cloud could keep up in theory. In practice it does not, for the four reasons above (latency, jitter, cost, sovereignty), and because the frontier labs gate the largest models behind partnership tiers a robotics startup does not have access to. Open-weight 70B-class VLMs do exist, can be hosted, and run well on commodity 8-GPU servers. That confluence — open-weight frontier-class models plus commodity multi-GPU hardware — is the reason on-prem is the right answer right now and was not in 2023.

This is not a Kentino opinion either. The on-prem inference tier is what NVIDIA's Isaac stack is built around, what the major robotics platforms ship reference architectures for, and what every serious integrator we have talked to in the last six months is provisioning. The market caught up to the hardware caught up to the models, and 2026 is the year it cleared.

When the on-prem path is the wrong answer

To stay honest: the dedicated edge tier is overkill in several real cases.

Teleoperated robots. The operator is the planner. The robot is a puppet with a control link. There is no VLM in the loop; what little inference happens (pose estimation, low-latency video coding) can run on the Jetson. Add cloud for any heavy lifting if needed. No GPU server needed.
Simple inspection quadrupeds. A Spot-class or Go2-class robot walking a fixed patrol path with YOLO-grade detection and a thermal camera does not need a 72B VLM. The on-board Jetson does the job. The data goes to the cloud for analysis after the patrol, not during.
Demos and one-off pilots. You need the system running for a trade show, a customer pitch, a three-week proof-of-concept. Cloud gets you there in an afternoon. On-prem capex is wrong for a workload that will be torn down in a month.
Hobby and education use. A university lab with one G1 EDU, a constrained budget, and a focus on RL training over inference. The Jetson and a single 4090 workstation can host enough to do meaningful research. The full K-AI 8× tier is the wrong shape of bill.
Pure language workloads. A robot that talks but does not see — voice-only assistant on legs. A cloud LLM API is fine. The latency budget is conversational, not closed-loop.

The pattern: if your robot is not running a real VLM in the closed perception loop, you do not need the dedicated edge tier. If it is, you do.

When the on-prem path makes the case

The dedicated edge tier becomes the right answer when at least one of the following is true and ideally two or more:

A real VLM is in the closed loop. Not "for occasional questions" — in the perception-to-action loop, running every 100–500 ms. This is the structural reason on-prem exists.
Sustained load. Eight hours a day, five days a week, indefinitely. Capex amortises. Cloud cost-per-call accumulates indefinitely.
Data cannot leave the building. GDPR, HIPAA, ITAR, competitive, or simple paranoia. The bytes stay on the LAN. Non-negotiable.
Multiple robots. Two or more units sharing a scene memory or coordinating tasks. The shared server's amortisation is per-robot, and the cross-robot latency budget collapses.
Customisation matters. Fine-tuned VLMs, pinned model versions, proprietary heads on open backbones, niche quantizations. The freedom is the product.
Iteration speed matters. The team is shipping policy updates weekly or faster. Sim and training on the same hardware as inference closes the loop from days to hours.

If you check three or more of those, the question is no longer if on-prem but which size. Four-GPU K-AI 96 (RTX 4090 or 5090) tier is enough for one robot doing real work. Eight-GPU K-AI 256 (5090 or Pro 6000 Blackwell) is the right shape for a small fleet, the largest VLMs, or any deployment with a training requirement on top of inference.

What to do next

If you are scoping a deployment, the questions that determine the answer:

Is there a VLM in your closed loop? If yes, you need the edge tier. If no, skip the rest.
How many robots, peak and sustained? This sizes the GPU count. Rough rule: one VLM at frontier scale needs 4 GPUs minimum; each additional robot in the fleet adds roughly one GPU of concurrent demand.
What is the largest model on your roadmap, not just today? Buy for the 24-month horizon. 70B-class VLMs are the floor in 2026; 100B+ is likely by 2027.
Where does the data have to stay? If the answer is "in the building", on-prem is the only valid answer. If "in the EU", on-prem or a sovereign EU cloud. If "anywhere", you have options.
Training, fine-tuning, or inference-only? Training roughly triples the GPU and storage budget. Be honest with yourself about whether the team will actually do it.
Power and cooling envelope? Most labs find out the hard way that they cannot deliver 4–5 kW continuous. Plan the room before the install.

The follow-up articles in the series go deeper on the wiring (I01 already published), the inference-server software stack (I02 next), and the reference build with parts and benchmarks (I05). The capability map sketched here is the why; those are the how.

The on-prem inference tier is not a luxury or a procurement preference. For VLM-driven robotics in 2026, it is the only tier that fits the math. Everything else is a compromise you take knowing exactly which capability you gave up.

This is part of the Kentino Wiki, a reference series on AI compute, robotics, and the systems that connect them. Comments and corrections welcome at info@kentino.com.

Назад към блога

Артикулът е добавен в количката