Autolabeling the Environment: VLM-Driven World Models for Robots

25 май 2026 г.

In 2023 a credible household-robot dataset required a thousand person-hours of human annotators drawing boxes around mugs and chairs. In 2026 the same dataset gets produced overnight by a stack of vision-language models running on a single 8-GPU server. The human is still in the loop, but only as a reviewer of a sampled tier, not as the primary labeler. This article is about that shift — what "autolabeling" actually means for a robot today, what the pipeline looks like, where it breaks, and why the compute footprint is the part that decides whether your team can do it at all.

This is part of the Robotics track of the Kentino Wiki. It cross-references R08 (latency argument for dedicated edge compute) and I01 (edge AI architecture with on-prem inference). A future I05 will walk through the reference build sized for exactly this workload.

What autolabeling means in robotics

The classical computer-vision pipeline assumed labels were scarce and expensive. A bounding box around a "cup" cost a human ten seconds and a few cents. A pixel-precise segmentation mask cost a minute and a dollar. A frame-by-frame mask across a thirty-second video clip cost the price of a small car.

Robotics datasets are pathological for that model. A single quadruped on a thirty-minute mapping run at 30 fps produces 54,000 frames. A humanoid teleop session over a workday produces hundreds of thousands. Each frame ideally wants:

Object bounding boxes (open-vocabulary, not just COCO's 80 classes)
Instance segmentation masks (so the policy can reason about graspable regions)
A natural-language scene description (so a VLA can be conditioned on it)
Tracked identity across frames (so "the red mug" stays the same mug)
Optional: 3D position estimates, fused with depth or LiDAR

The human-annotator economics for any of those die at the first thousand frames. "Autolabeling" is the umbrella term for using foundation models — VLMs, open-vocabulary detectors, promptable segmenters — to produce those labels at the speed of inference rather than the speed of a person clicking.

The shift since 2023 is not philosophical, it is mechanical. Three things changed in the same eighteen-month window:

Open-vocabulary detection got usable. Grounding DINO, OWLv2, and Florence-2 went from "interesting demo" to "production-grade for ~80% of common objects" between mid-2024 and late-2025.
Promptable video segmentation arrived. SAM 2 (mid-2024) and now SAM 3 (released November 2025) made it cheap to track masks across video given a noun-phrase prompt. SAM 3 in particular accepts concept prompts directly — "yellow school bus" — and returns masks plus stable identities.
VLMs got grounded. Qwen2.5-VL (early 2025) and the follow-on Qwen3-VL families output bounding boxes on the true pixel grid in stable JSON. You can prompt a 72B VLM with "list every object in this image as JSON with bbox and a one-sentence description" and get back something you can pipe into a training loop.

The 2026 state of the art is not one model — it is a composed pipeline.

Pipeline architecture

The reference autolabeling stack looks like this:

Robot record

RGB + depth + IMU + joint states, 10–30 fps
Stored to local NVMe, then synced to server

↓

Stage 1 — Open-vocabulary grounding

Grounding DINO | OWLv2 | Florence-2
in: frame + caption vocabulary (or VLM-generated free-form caption)
out: bounding boxes + class labels per frame

↓

Stage 2 — Promptable segmentation + tracking

SAM 2 or SAM 3 with Stage 1 boxes as prompts
out: per-instance masks, tracked identity across the clip

↓

Stage 3 — Scene description + relations

Qwen2.5-VL 72B | Cosmos Reason 2
in: frame + boxes/masks from stages 1+2
out: per-frame caption, per-object captions, inter-object relations ("mug ON table")

↓

Stage 4 — World-model accumulation

ConceptGraphs-style 3D scene graph
Project labels into 3D via depth + camera pose
Dedupe across views, build object instance store

↓

Stage 5 — Human review tier (sampled)

1–5% of frames pulled by uncertainty score
Reviewer corrects in Roboflow / Labelbox / V7
Corrections feed back as training signal

↓

Stage 6 — Policy training / conditioning

Fine-tune the VLA (OpenVLA-class, OFT recipe)
or condition a manipulation policy on the labeled trajectories

Six-stage autolabeling pipeline — record → ground → segment → describe → accumulate → review → train

A few things are worth calling out before we move on.

First, stages 1 and 2 are often collapsed into Grounded-SAM 2, the open pipeline from IDEA-Research that wires Grounding DINO (or Florence-2 or DINO-X) into SAM 2 in one shot. The autolabel script in that repository is the canonical "boxes and masks from a noun phrase" implementation. With SAM 3's concept-prompt interface this collapses further — you give it the words, you get back tracked masks.

Second, stage 3 is the expensive one and the one where model choice matters most. A 7B VLM (Qwen2.5-VL 7B, Florence-2 large) will produce coherent captions cheaply but miss subtleties. A 72B class model produces dramatically richer descriptions, gets relations right more often, and is far more useful for downstream VLA training — at roughly 10× the cost per frame.

Third, stage 4 is what people mean when they say "world model" in this context. It is not a generative video model like Cosmos Predict. It is a persistent, 3D-aware store of "what objects exist in this room, where they are, how they relate." ConceptGraphs is the canonical open-source recipe; OK-Robot demonstrated it scales to ~170 pick-and-place tasks across ten homes. The world model is what makes the labels reusable: when the robot comes back tomorrow, it does not start from scratch.

What VLMs do well, and where they fail

Honest table, because the marketing material on every one of these models is misleading in different directions:

VLM stack quality — 2026 assessment by task type

Task	VLM stack quality (2026)
Common-object detection (kitchen, office)	Excellent — 90%+ recall, low hallucination
Open-vocabulary novel categories	Good but uneven — depends on phrasing
Pixel-precise segmentation given a good box	Excellent — SAM 2/3 is essentially solved
Tracking identity across a 30 s clip	Good with SAM 3, mediocre with SAM 2 alone
Counting (how many cups on the table)	Poor — VLMs hallucinate counts persistently
Small / distant objects	Poor — boxes drop below ~20 px reliably
Fast motion (gripper, swung arm, dropped item)	Poor — motion blur kills both detection and seg
Lighting extremes (glare, low-light, IR)	Poor — training distribution doesn't cover this
Repeated identical objects (stacked boxes)	Poor — identity tracking gets confused
Novel categories from a niche industrial domain	Bad — open-vocab is "open" within ImageNet land
Free-form scene description (a paragraph)	Excellent — 72B VLMs are genuinely good here
Spatial relations (on, under, behind)	Good — Qwen2.5-VL handles this reliably

The single most important honest call: autolabels are noisy. Across the literature in 2025–2026, open-vocabulary detection on out-of-distribution domains lands at 5–15% hallucination depending on how you measure it. The GroundCount paper from earlier in 2026 reported a 6.6 percentage point improvement on counting accuracy just by adding explicit detector grounding to a VLM — which means VLMs alone are still substantially wrong on counts. None of this is a deal-breaker, but it means a pure unreviewed autolabel pipeline is not safe for safety-critical training data.

The mitigation that actually works in practice is the two-tier sampling review: you autolabel everything, then pull 1–5% of frames for human review based on an uncertainty signal (VLM token entropy, detector confidence, multi-model disagreement). The reviewers correct, and those corrections get used either as direct training data or as feedback to recalibrate the autolabeler's confidence thresholds. This is the same loop that Florence-2 itself was trained on — Microsoft's FLD-5B dataset was built by cascading specialized models and then sampling for review.

Compute footprint — why this lands on-prem

This is the part that surprises people who have not run the numbers.

Take a representative target: one hour of robot footage at 10 fps from a stereo camera at 1080p. That is 36,000 frames. You want all four label types: boxes, masks, captions, tracked identity.

Rough per-frame cost on a single RTX 5090 (32 GB, Blackwell, ~104 TFLOPS FP16):

Per-stage compute — 36 000 frames on a single RTX 5090

Stage	Per frame	36 000 frames
Grounding DINO (Tiny)	~30 ms	~18 min
SAM 2 large, mask + propagation	~25 ms	~15 min
Qwen2.5-VL 7B caption	~250 ms	~2.5 h
Qwen2.5-VL 72B caption (INT4, batch)	~1.5–3 s	~15–30 h
Florence-2 large (caption only)	~80 ms	~48 min

These numbers are order-of-magnitude — they assume reasonable batching, vLLM serving, and FP16/INT4 quantization where appropriate. SAM 2 alone runs at ~44 fps on an A100 in the original benchmark, so ~50–60 fps on a 5090 is realistic.

The interesting line is the 72B VLM. If you want rich scene descriptions for every frame from a 72B-class VLM, you cannot do it on a single GPU in real-time. You either:

Subsample heavily — caption every 10th frame, interpolate the rest. This is what most production pipelines actually do.
Use a smaller VLM (7B–11B class) for per-frame and reserve the 72B for keyframes only.
Throw more GPUs at it — at which point eight 5090s in one chassis becomes the bottom of the practical range.

The aggregate cost for a full autolabel pass on one hour of 10 fps footage with the 72B in the loop lands at roughly 4–8 GPU-hours on consumer Blackwell silicon, and the 8× 5090 K-AI 256 chassis can finish it in well under an hour wall-clock with parallelism across GPUs.

Now the cloud math. The same workload on a hyperscaler:

Compute: comparable, perhaps cheaper at spot pricing.
Data egress: brutal. A 1080p stereo recording at 10 fps for an hour is ~30–80 GB raw, more if you keep depth. Storing it in cloud and pulling labels back out costs cents on the way in and tens of dollars on the way out per pass. The Robo-DM paper from Berkeley in 2025 measured this explicitly: storing 8.9 TB of Open-X data on Google Cloud costs $172/month, but every full download costs $172–$1,540 depending on tier. Scale that across a fleet that records hundreds of hours per week and the egress alone exceeds the capex amortization of a single on-prem server within a year.
Latency on the loop: long. The point of autolabeling is the closed loop — record today, label tonight, fine-tune tomorrow, push improved policy by the morning. A cloud round-trip adds hours of upload time on a typical lab uplink.
Privacy: a problem. The same regulated-data argument from R08 applies here. Raw robot video from a patient room, a factory floor, or a defense lab does not go to anyone else's GPU.

This is why every serious robotics lab in 2026 owns its autolabeling compute. The K-AI 256 Turin Dual with 8× RTX 5090 is sized almost exactly for this workload — 256 GB system RAM, eight GPUs for parallel pipeline stages, NVMe for the dataset hot tier. The 4× RTX Pro 6000 Blackwell configuration is the upgrade path when the team wants to run the 72B in FP16 instead of INT4 and keep more concurrent models resident.

The closed loop

The reason the on-prem footprint pays back is not the autolabeling itself — it is the loop it enables.

Day N evening

Robot fleet returns from deployment, syncs ~6 h of footage

↓

Day N night

Server autolabels overnight (4–8 GPU-h per robot-hour)

↓

Day N+1 morning

Reviewer team handles the 1–5% flagged tier

↓

Day N+1 afternoon

LoRA / OFT fine-tune of the VLA policy

↓

Day N+1 evening

New weights packaged, validated in sim

↓

Day N+2 morning

Push to fleet, robots deploy with updated policy

Daily closed loop — record → autolabel → review → fine-tune → validate → deploy

This is the loop that the OpenVLA-OFT recipe (March 2025) was designed for: 25–50× faster fine-tuning than vanilla OpenVLA, designed to fit on a single workstation-class GPU server. FLaRe (ICRA 2025) is the reinforcement-learning analogue. The continual-learning work on adapter-based fine-tuning (OMLA, LifeLong-RFT) lets you adapt without catastrophic forgetting.

None of this works at cloud round-trip cadence. The loop is the value, and the loop requires the data and the compute to be in the same building.

A concrete example — household humanoid

To make this concrete, imagine the simplest viable autolabel pipeline for a humanoid running household tasks (load dishwasher, fold laundry, fetch items from a labeled bin).

Recording: the humanoid has stereo RGB cameras at 30 fps, wrist cameras at 15 fps, depth from active stereo, joint states at 200 Hz. A two-hour session produces ~250 GB raw on the on-board NVMe.

Sync: at end-of-session the robot uploads to the lab's K-AI server over wired or Wi-Fi 6E, ~5–10 minutes for 250 GB.

Stage 1+2 (Grounded-SAM 2): open-vocabulary detection with a domain vocabulary of about 200 household nouns ("mug", "spatula", "laundry basket", "blue dish-towel"…) plus the agent's own end-effectors. SAM 2 propagates masks through clips. Wall-clock on 8× 5090: ~45 minutes.

Stage 3 (Qwen2.5-VL): 7B VLM at every frame for a brief caption, 72B at every tenth frame for a richer description plus inter-object relations. Wall-clock: ~3 hours.

Stage 4 (scene graph): ConceptGraphs-style accumulator builds a persistent 3D scene graph of the apartment. By the end of the week, every object the robot has seen lives in the graph with a stable ID, language descriptors, and a coarse 3D position. Wall-clock: a few minutes per session, amortized.

Stage 5 (review): an internal tool surfaces frames where the VLM's class confidence < 0.6, or where Stage 1 and Stage 3 disagree on a class. A reviewer handles ~500 frames per hour. With a 5% sample rate on a two-hour session, that is roughly an hour of human time per day.

Stage 6 (training): the corrected labels feed an OFT-style fine-tune of the VLA. The K-AI server runs this overnight on the same hardware that did the autolabeling — the workloads are sequenced, not concurrent.

This is not a research thought experiment. This is what 1X, Skild AI, and the published groups using OpenVLA actually do in 2026, modulo internal variations. The pipeline is open, the models are open, the bottleneck is compute and engineering effort — not access to the algorithms.

Honest limits

Three things that this article should not let pass without acknowledgment:

Hallucination is real and persistent. Even with the two-tier review, you cannot trust unreviewed autolabels for safety-critical training (collision avoidance, contact decisions, anything where a wrong label could harm the robot or a person). Use them for capability training, not safety training. For safety, you still want curated data.

Out-of-distribution grounding degrades fast. A VLM trained primarily on web images will be excellent in kitchens and offices and noticeably worse on a CNC shop floor or a hospital ward. The fix is domain-specific fine-tuning of the autolabeler itself, which has its own cost.

The world model is brittle to environment change. ConceptGraphs and friends assume the world is roughly static between visits. Move the furniture, and the scene graph needs to be rebuilt or aggressively re-validated. There is active work on this (online open-vocabulary scene graphs, the 2025 Naver Labs paper among others), but treat the world model as advisory, not authoritative.

Compute estimates here are rough. All the per-frame numbers depend on batching strategy, quantization, prompt length, and image resolution. Treat the table as order-of-magnitude. The order-of-magnitude is what matters for sizing the box.

What to do next

If you are evaluating whether to stand up an autolabeling stack:

Decide what you actually need labeled. Boxes and masks alone — Grounded-SAM 2 on a single GPU is enough. Captions and relations — you need a 7B–11B VLM minimum. Rich descriptions for VLA training — you need 72B-class, and you need to budget the GPU-hours honestly.
Audit your domain. Are the objects you care about in the open-vocabulary detectors' training distribution? If you are mostly working in kitchens, offices, or warehouses — yes. Industrial or medical specialty domains — plan for fine-tuning the autolabeler before you trust it.
Plan the review tier from day one. Pick a tool (Roboflow, Labelbox, V7, or a homegrown one with uncertainty-based sampling) and budget at least one reviewer-FTE per ten robot-hours-per-day of recording. The autolabel pipeline does not replace humans, it changes what humans do.
Size the compute for the 72B step. The other stages fit on anything. The 72B VLM at scale is the line item that justifies the 8-GPU server. If your pipeline only ever uses 7B-class VLMs, a 4-GPU box is enough. If you want the richer descriptions and the closed-loop fine-tune cadence, you want the 8-GPU configuration.
Put the storage tier on NVMe and the cold tier on spinning disk. A week of fleet recording is terabytes. The autolabeler is constrained more often by I/O than by GPU compute when you are using the smaller models.

The Kentino lineup has the K-AI 256 Turin Dual / 8× RTX 5090 sized for this workload at the consumer-silicon end, and the K-AI 4× RTX Pro 6000 Blackwell at the higher-VRAM end when you want to keep multiple large VLMs resident concurrently. Pricing and build details are in the relevant product pages and in a future I05 article that walks through the full reference build.

The bleeding edge of this stack is moving every quarter — SAM 3 is six months old, Qwen3-VL just shipped, Cosmos Reason 2 is fresh — so the specific models in this article will date faster than the architecture. The architecture itself is now stable. Boxes, masks, captions, scene graph, review, train, deploy. That loop is not going anywhere.

This is part of the Kentino Wiki, a reference series on AI compute, robotics, and the systems that connect them. Comments and corrections welcome at info@kentino.com.

Вернуться к блогу

Товар добавлен в корзину.