The Robot Sensor Stack

15. jún 2026

A modern legged robot ships with somewhere between fifteen and forty distinct sensor channels, depending on how you count. The marketing photo shows two — the LiDAR turret on top and the depth camera on the head. The other thirty-eight are inside the chest, in the joints, on the feet, and (increasingly) in the fingers. They are what makes the difference between "the robot stood up" and "the robot stood up, knew where it was, and decided what to do next."

This article walks through what is actually on a 2026 humanoid or quadruped, what each sensor is for, what the credible options are at each tier, and the honest take on which ones are oversold. The previous articles in this series — R01 and R02 — listed sensors per platform. This one is the reference.

Two worlds: proprioceptive and exteroceptive

Every robot sensor falls into one of two categories, and the split matters because the two worlds have very different latency, accuracy, and failure-mode budgets.

Proprioceptive sensors tell the robot about itself: where its joints are, how fast they are moving, which way is down, whether a foot is on the ground. These run at high rates (200 Hz to 2 kHz), they are deterministic, they cannot tolerate jitter, and they live on the real-time motor-control bus. If any one of them lies, the robot falls over within 50 ms.

Exteroceptive sensors tell the robot about the world: cameras, depth sensors, LiDAR, microphones, tactile pads. These run slower (10–60 Hz), they are statistical (every depth pixel has noise, every LiDAR return has reflectivity scaling), and they live on the perception bus — Ethernet or USB into the application processor. If one fails, the robot stops moving but does not fall.

A useful mental model: proprioception is what keeps the robot upright. Exteroception is what makes it useful. You cannot trade one for the other.

Category	Examples	Rate	Failure mode
Proprioceptive	IMU, joint encoders, foot contact, motor current	200 Hz–2 kHz	Robot falls
Exteroceptive	RGB, depth, LiDAR, mic, tactile	10–60 Hz	Robot stops, requests help

The IMU — the sensor everything else hangs off

The Inertial Measurement Unit is the single most important sensor on a legged robot. It provides three-axis acceleration and three-axis angular velocity (a 6-DoF IMU) or those plus magnetometer (9-DoF). The control loop reads it at 1 kHz or higher and uses it to estimate trunk orientation, derive balance error, and predict the next half-step.

A 2026 humanoid typically carries two IMUs (one in the trunk, one in the head), and the more capable platforms add one per foot. Quadrupeds usually run one in the body and derive foot-state from joint torque rather than per-foot inertial.

IMU	Class	Rate	Gyro bias stability	Typical use
Bosch BMI088	Consumer high-perf	up to 2 kHz	~2 °/h	Trunk on most humanoids/quadrupeds
TDK ICM-42688	Consumer high-perf	up to 8 kHz	~3 °/h	Trunk, low-cost platforms
VectorNav VN-100	Industrial MEMS	800 Hz	~5–10 °/h	Inspection-grade outdoor
VectorNav VN-200/VN-300	GNSS-aided	400 Hz	tactical	Outdoor, mapping units
Tactical-grade FOG	Fibre-optic	1–4 kHz	~0.01 °/h	Not in legged-robot price tier

The BMI088 is genuinely the workhorse. It was designed for drones and robotics, the gyro is rated through ±2000 °/s slew, and it survives the vibration spectrum a walking humanoid produces. The ICM-42688 from TDK is the alternative that shows up in cost-driven builds. VectorNav is the step up when you actually need outdoor heading-grade performance, which most indoor humanoids do not.

The IMU is also what enables ZUPT (zero-velocity update) — the trick where, during the stance phase of a step, the foot is known to be stationary, so you can use that interval to reset the velocity estimate in the SLAM pipeline. Without ZUPT, dead-reckoning drift kills VIO and LIO over minutes. With it, drift drops by an order of magnitude. A robot without a per-foot or per-leg IMU loses this option, which is one reason high-end research humanoids put IMUs in the feet.

The trap to avoid: sample rate marketing. A 2 kHz IMU at 5 °/h bias is much worse than a 400 Hz IMU at 0.5 °/h bias for any task longer than ten seconds. Read the bias stability number, not the headline rate.

Joint encoders — the sense the spec sheet hides

Every actuated joint has at least one encoder. Most production humanoids and quadrupeds use two per joint — an absolute encoder on the output shaft (for safety and startup), and an incremental encoder on the motor (for high-resolution velocity feedback). This dual-encoder arrangement is the difference between a research prototype and something you can leave running overnight.

The two dominant families:

Magnetic absolute encoders. AS5048 and successors (ams-osram), MT6816, AS5047. 14-bit resolution = 16,384 counts per revolution = 0.022° native accuracy. After microcontroller linearisation, ~0.05° is achievable. Contactless, no mechanical wear, survives shock and vibration. The standard.
Optical encoders. Higher resolution at the high end (20-bit, ~0.0003°), better in clean industrial environments. But they have a glass disc, they need alignment, and they do not love a robot leg's vibration spectrum. You see them in surgical and precision lab robotics, not in walking platforms.

The trade-off is straightforward: magnetic encoders are the right answer for legs and most arms; optical encoders show up in wrist and finger joints of high-end dexterous hands where you genuinely need that last decimal place.

Calibration drift is the failure mode no one warns you about. Encoders are accurate but referenced to a zero point set at boot or commissioning. Over time, the mechanical relationship between encoder and joint shifts by tenths of a degree from thermal cycling, harmonic-gear wear, and impacts. Production deployments need a re-zero procedure (usually quarterly) and a monitoring threshold that alarms when joint-zero drift exceeds spec. Skipping this is how a robot that was perfect at install starts dropping objects six months in.

Depth cameras — the perception workhorse

The RGB-D head camera is the single most-used exteroceptive sensor in 2026 legged robotics. It is good enough for obstacle avoidance, terrain classification at one to three metres, grasp-point detection on a tabletop, and short-range SLAM. Three technologies compete:

Tech	How it works	Strengths	Weaknesses
Stereo (passive)	Two RGB sensors, disparity → depth	Outdoor-friendly, no projector	Needs texture; fails on plain walls
Stereo + IR projector	As above plus structured IR pattern	Works on textureless surfaces	Range falls in bright sunlight
Structured light	Projects known pattern, decodes deformation	High accuracy indoors	Very poor outdoors; short range
Time-of-flight (ToF)	Measures light travel time pixel-wise	Wide FOV, indoor accuracy	Multi-path artifacts, lower resolution

The two product families that dominate the legged-robot world:

Intel RealSense. The D435 and D435i are the legacy default — 87° × 58° FOV, 0.1–10 m range, ~2% depth error at 2 m, global shutter on the depth sensors. The D435i adds a Bosch BMI055-class IMU integrated into the camera, which simplifies the time-sync problem. The D455 doubled the baseline to 95 mm, which doubled effective range (depth error <2% at 4 m), and added the IMU as standard. The D457 is the GMSL2 / FAKRA-ruggedised D455 — same imaging, automotive-grade connector, IP65 housing, designed for robots that actually leave the lab. Almost every research humanoid in R01 ships either a D435i or D455 on the head; the D457 is the version showing up on outdoor-rated quadrupeds.

Orbbec. The Chinese alternative — Femto Bolt, Femto Mega, Gemini 335 / 336 / 2 XL. The Femto Mega is interesting because it integrates a Jetson Nano on-camera and uses Microsoft's ToF sensor (the same as in Azure Kinect DK, since Microsoft exited that market). Gemini 2 XL extends usable depth to 20 m at the long end. The Orbbec models are typically 30–50% cheaper than RealSense at equivalent specs and have become the default on Chinese-built platforms when supply-chain risk on Intel parts is a concern. For pure indoor robotics they are equivalent; for outdoor and high-vibration use the RealSense ecosystem still has the better autopilot and ROS 2 integration.

A trap that catches first-time integrators: multi-camera IR interference. Put two structured-light depth cameras in the same room facing each other, and both lose depth in the overlap zone. Mitigations: time-multiplex the projectors (hardware-trigger one off while the other captures), or move to pure stereo (no active projector). This is one of the reasons larger fleets gravitate to passive stereo or ToF.

RGB cameras — and why the shutter type matters more than the megapixels

The depth camera's RGB sensor is usually fine for what it is, but a serious manipulation rig adds a separate RGB camera for high-resolution colour. The non-obvious question is global shutter vs rolling shutter.

A rolling-shutter sensor reads rows of pixels sequentially. If the camera or the scene moves during the readout, the image shears — a vertical line becomes a diagonal, a spinning wheel becomes a banana. For a stationary table-top inspection, this is fine. For a walking humanoid, where the head accelerates at 5 m/s² laterally each step, rolling shutter destroys feature matching, breaks visual odometry, and produces unusable input to VLMs.

The rule is straightforward: anything on a moving robot needs global shutter. The D435i, D455, and D457 depth modules are all global shutter (the colour sensor in some variants is rolling — read the datasheet). A separate global-shutter RGB camera adds €200–€600 to a build. Skimping here is one of the most common and most damaging spec errors made by first-time robot builders.

LiDAR — when it earns its keep and when it does not

LiDAR is the most over-specified sensor in 2026 robotics. It is genuinely necessary on outdoor quadrupeds, multi-room SLAM platforms, and any unit that needs to see beyond 10 m reliably. It is overkill on indoor manipulation humanoids that operate in a 3 m × 3 m volume.

The two architectures:

Spinning mechanical LiDAR. Traditional rotating-mirror design. Higher cost, higher channel counts, mature. Ouster OS0 / OS1 (now Rev8 with native colour, 2026), Hesai XT16 / XT32 / QT128. Spinning units are wear parts but the bearings now last 30,000+ hours.
Solid-state / hybrid-solid-state LiDAR. Livox's rotating-mirror hybrid (Mid-360, HAP, Avia) and pure MEMS solid-state. No external rotating part, lower cost, sometimes non-uniform scan patterns. Mid-360 has become the default for legged robots.

Typical sensors in the legged-robot world:

LiDAR	FoV (H × V)	Range (10% refl.)	Points/s	Approx cost	Where it lands
Livox Mid-360	360° × 59°	~40 m (70 m, 80%)	200 k	~$800	Default on Unitree G1 / SE01 / B2
Livox HAP	120° × 25°	150 m	452 k	~$1.4k	Forward-facing on larger units
Livox Avia	70° × 77°	190 m (450 m max)	240 k	~$1.4k	Niche; high-density forward
Hesai XT16	360° × 30°	50 m	320 k	~$3.5k	Unitree Go2 EDU+, B2
Hesai XT32	360° × 31°	120 m	640 k	~$5–6k	B2, premium dog tier
Hesai QT128	360° × 105°	50 m	1.5 M	~$6–8k	Industrial, dense overhead
Ouster OS0 (Rev8)	360° × 90°	50 m	1.3–2.6 M	~$8–12k	Spot, ANYmal payload
Ouster OS1 (Rev8)	360° × 45°	120–200 m	1.3–5.2 M	~$8–15k	Industrial mapping rigs

The Livox Mid-360 deserves its dominance — it is small (under 300 g), low-power (under 10 W), has 360° coverage adequate for SLAM, and at $800 it is cheap enough to ship on consumer-tier dogs. For a humanoid that walks indoors and needs to map a building, the Mid-360 is the answer 80% of the time.

When LiDAR genuinely matters:

Outdoor or multi-room SLAM, where depth-camera range (5–8 m) is not enough.
Long-corridor navigation, where you need to see the end of the hallway from the start.
Bright sunlight, where IR-projector depth cameras fail and only LiDAR or passive stereo works.
Reflective / textureless environments (warehouses with painted floors, glass walls) where stereo loses features.

When LiDAR is oversold:

Indoor humanoid manipulation in a workspace under 5 m. A pair of D455s does the job for one-tenth the cost and weight.
"Marketing dog" demos that never leave a paved area. The LiDAR is there to look serious.
Cluttered rooms with abundant texture. Stereo is genuinely good enough, and the LiDAR is doing duplicate work.

If you can answer "no" to all four "when it matters" questions above, you do not need LiDAR. Save the $800–$8,000 and put it into a global-shutter RGB camera and a better IMU.

Force/torque on the end-effector — the manipulation enabler

The moment a robot tries to do anything other than pick up rigid plastic blocks, force/torque sensing at the wrist becomes the bottleneck. A 6-axis F/T sensor measures three force components and three torque components at the joint between the wrist and the gripper/hand, and feeds them into the impedance controller that decides how hard to push, when to stop, and when something is jammed.

The reference units come from ATI Industrial Automation (Mini, Nano, Axia, Mini43LP). Mini43LP is the current sweet spot for humanoid wrists — under 8 mm tall, 6-axis, semi-conductor strain gauges, real-time data at 8 kHz over EtherCAT or Ethernet. Cost is in the $4k–$12k range depending on configuration, which is why most sub-$50k humanoids ship without F/T sensing as standard. The Unitree H1-2 with manipulation arms supports it; the G1 base does not.

If your task is "carry a tray of cups", "open a refrigerator door without ripping the handle off", or "hand a tool to a human", you want F/T at the wrist. If your task is "pick up a known object from a known location", you can fake it with motor current feedback and skip the dedicated sensor.

Tactile sensors — the real-but-immature frontier

Tactile sensing is the technology that 2026 humanoid demos are loud about and 2026 humanoid production deployments mostly do not use. The technology has matured enough to be useful in research; it has not matured enough to be cheap, reliable, or ship-and-forget.

The two relevant families:

GelSight / DIGIT. Camera-based tactile sensors — a soft elastomer pad with markers, lit and imaged from underneath by a tiny camera. Reads surface contact geometry at sub-millimetre resolution, derives shear and normal force from marker displacement. Originally MIT research, productized by GelSight, the DIGIT variant developed with Meta AI (Digit 360 announced late 2024, Phase II SBIR contracts active 2026). Excellent at fine geometry — texture, edges, slip detection. Cost is in the $1k–$3k range per finger, plus the camera processing load.
Capacitive and piezoresistive arrays. Older technology, lower resolution, much cheaper. Used in some grippers for crude contact-yes-no. Not really tactile sensing in the modern sense.

What tactile actually enables on a humanoid:

Slip detection. The robot feels an object slipping in the grip and increases force before it falls. This is the single most useful tactile signal in current research.
Texture-based object discrimination. Cloth vs leather vs metal at the fingertip.
Geometry reconstruction. Building a 3D model of an object the hand is exploring blind.

What tactile does not yet do reliably:

Survive a dishwasher cycle worth of robot use. Soft gel pads wear, get cut, need replacement on a schedule like foot pads.
Work without dedicated GPU. Camera-based tactile is real-time computer vision; each sensor adds 10–30% load on a Jetson AGX.
Standardize. Every research group has their own tactile flavour. No ROS 2 standard message exists yet for tactile that everyone agrees on.

The honest take: tactile is going to matter in 2027–2028 production. In 2026, putting GelSight DIGITs on a research humanoid is a real capability if you have the research budget for it. Ordering one for a production deployment is premature.

Microphones, foot contact, and the rest

A few more sensors that round out the stack.

Microphone arrays. A 4-mic array is the standard, sometimes 6 or 8 on premium platforms. They do three jobs: wake-word detection (often local, low-power DSP), direction-of-arrival estimation for "who said that", and beamforming for clean speech-to-text in a noisy room. The DOA is the underrated feature — a robot that turns its head toward whoever spoke to it feels dramatically more present than one that does not. Single microphones are fine for desktop assistants and useless on a robot at 1.5 m height in a room with three people.

Foot contact / pressure. Two implementations. Load cells under the foot are the explicit version, used on Spot and ANYmal D. IMU-and-torque-derived is the implicit version, used on every Unitree and DeepRobotics platform — when the foot lands, the leg torque jumps and the foot IMU sees the impact, and the gait state machine infers contact from that. Load cells are more reliable on slippery surfaces where torque alone is ambiguous; the derived version is cheaper and good enough for 90% of the work.

Wheel encoders (for quadruped wheel-foot hybrids). Unitree B2-W and Go2-W add a powered wheel at the end of each leg. Each wheel has its own incremental encoder, and the SLAM pipeline fuses wheel odometry with IMU + LiDAR. Wheel odometry is the cheapest dead-reckoning input there is — when it is available, the SLAM solution is more stable. The trade-off is one more failure mode (wheels slip, wheels get gunked, wheels fail), which is why the wheel-foot variants need more careful state-estimation tuning than their walking siblings.

Sensor synchronization — the invisible failure mode

The thing that breaks more robot perception pipelines than anything else, and the thing almost no one talks about until they hit it, is time synchronization across sensors.

The problem: a depth camera produces frames at 30 Hz, the IMU runs at 1 kHz, the LiDAR at 10 Hz, and the joint encoders at 1 kHz. Each sensor has its own clock and its own latency from event to timestamp. If those clocks drift, or if the timestamps are wrong, sensor fusion produces garbage and no one knows why — the robot starts hallucinating obstacles, or its SLAM map smears every time it turns.

Three levels of fix, in order of capability:

Software NTP. Cheap, gets to 1–10 ms accuracy on a wired LAN. Fine for logging and high-level supervision. Catastrophic for tight sensor fusion at 1 kHz.
IEEE 1588 PTP (Precision Time Protocol). Hardware-assisted timestamping in the NIC, sub-microsecond accuracy on a switched Ethernet network. This is what serious multi-sensor robots use, and it requires PTP-capable switches and sensors. Livox LiDARs support PTP. RealSense supports it via firmware. Most cheap USB depth cameras do not.
Hardware trigger. A physical pulse line that fires the camera, the LiDAR, and the IMU sample-and-hold simultaneously. Nanosecond synchronization, but only works for sensors with an external trigger input — typically the industrial-grade cameras and LiDARs, not the consumer USB units.

What breaks without sync, in order of how often it bites:

VIO drift accelerates when IMU and camera timestamps disagree by more than 2–5 ms.
LIO produces ghost obstacles when the LiDAR and IMU disagree on where the robot was during the scan.
VLM-based grasping picks the wrong point when the depth frame and the joint-encoder reading used to position the gripper come from different moments in time.

If you are integrating sensors from more than one vendor, ask the PTP question early. The cost of fixing it after the fact is usually a re-architecting of the sensor bus, not a tweak.

Compute load per sensor — what the Jetson is actually doing

A modern Jetson AGX Orin (200 TOPS, 32 GB RAM) on a humanoid is doing more than the spec sheet suggests. A realistic load distribution under sustained operation:

Workload	Rate	Approx GPU load (AGX Orin)
Depth camera RGB-D ingest + filter	30 Hz	5–10%
LiDAR point-cloud preprocessing	10 Hz	5–10%
Visual-inertial odometry (VIO)	30–60 Hz	10–20%
LiDAR-inertial odometry (LIO) + local map	10 Hz	15–25%
YOLOv8/v11-s on head camera	30 Hz	15–20%
Whisper-small STT (on demand)	bursty	10–15% during
Wake-word DSP	continuous	1–3%
Small VLM (Qwen2.5-VL 3B Q4)	on demand	50–80% during
ROS 2 middleware overhead	continuous	3–5%

Add it up and the Jetson is at 60–80% sustained, with peaks pushing throttle when the VLM runs. There is essentially no headroom for a second large model, which is exactly the architecture argument from I01: the heavy thinking goes off-board because the Jetson is already full.

Tactile sensing, when added, takes another 10–30% per sensor pair. F/T sensing is essentially free (CPU-side, low rate). LiDAR is the surprise — point-cloud preprocessing is not cheap, and a high-density LiDAR (Hesai QT128, Ouster OS1 256-channel) will saturate a Jetson Orin Nano before the perception code even runs. Match LiDAR density to compute, not to marketing.

What to do next

The honest decision matrix for a 2026 legged-robot build:

IMU. Non-negotiable. BMI088 or ICM-42688 for indoor, VectorNav VN-100 or better if outdoor heading matters. Verify the bias-stability spec, not the sample rate.
Joint encoders. Non-negotiable. Dual encoders (absolute + incremental) on every actuated joint. Magnetic AS5048-class for legs and arms; optical only on precision wrists / fingers.
Depth camera. RealSense D435i for development, D455 for production indoor, D457 for ruggedised outdoor. Orbbec Gemini equivalents for budget-constrained or supply-risk-sensitive builds. Get global shutter on the depth path.
RGB camera. Add a separate global-shutter RGB if the depth camera's colour path is rolling. Skip if you have verified the shutter type.
LiDAR. Livox Mid-360 if you map indoor / mixed environments. Hesai XT16 / XT32 if you need spinning 360°. Ouster OS0 / OS1 for autonomy-software stacks (Spot / ANYmal payloads). Skip entirely on indoor manipulation-only humanoids.
F/T sensor. ATI Mini43LP-class at the wrist if your task involves contact-rich manipulation. Skip if pick-and-place from known poses is sufficient.
Tactile. GelSight DIGIT or equivalent if you are doing research on fine manipulation in 2026–2027 and have the budget. Production deployments — wait.
Mic array. 4-mic minimum if voice interaction matters. Single mic if voice is just wake-word.
Foot contact. Inherited from the platform. Load cells (Spot, ANYmal) for slippery industrial; IMU-derived (Unitree, DeepRobotics) for everything else.
Synchronization. Verify PTP support across all sensors before you commit to the bus architecture. This is the single most underrated decision in a multi-sensor build.

The marketing tendency is to add sensors. The engineering reality is that every additional sensor costs power, compute, integration time, and a calibration ritual. Add only what the task requires.

This is part of the Kentino Wiki, a reference series on AI compute, robotics, and the systems that connect them. Comments and corrections welcome at info@kentino.com.

Späť na blog

Položka sa pridala do vášho košíka