Recipe
A controlled roadmap for whole-body humanoid VLAs
OpenHLM studies how to collect full-body demonstrations, how to adapt a robot-pretrained VLA to a 34-D humanoid action space, and how cheaper heterogeneous data extends the policy beyond full whole-body teleoperation. The policy treats the humanoid as one coordinated kinematic chain rather than a wheeled base with arms.
One policy commands arms, waist, knees, feet, and grippers.
A single checkpoint follows task prompts across many skills.
HuMI and stationary data reduce the need for full-body teleop.
Benchmark first
HLM-12 benchmark
HLM-12 contains twelve language-conditioned real-robot tasks spanning pick-and-place with locomotion, whole-body workspace extension, body parts used as manipulators, and constrained contact-rich motion. The task list is explicitly written in the source in alphabetical order for easy editing.
Examples from the HLM-12 benchmark, spanning locomotion, manipulation, squatting, foot interaction, and constrained contact.
Context 01
Controller and teleoperation
The low-level controller tracks whole-body reference commands. The teleoperation interface therefore determines both what behaviors can be demonstrated and what action space the VLA later learns. OpenHLM adopts joint-based whole-body teleoperation: full-body motion capture retargeted online to robot joint space, tracked with 0.2 s future-frame preview.
Teleop method comparison
Switch the metric to inspect task progress, rollout time, or mean footsteps. Slashed cells mark tasks the method cannot perform by construction.
Decoupled control
Mobile-manipulation style control; the foot cannot be used as a manipulator.
VR 3-point
Sparse head-and-hands control gives only indirect command over the lower body.
Joint-based whole-body
Every joint is commanded, enabling squatting, full-body stepping, and foot interaction.
Context 02
VLA policy design
OpenHLM starts from a robot-pretrained VLA and adapts the interface to the humanoid: weight surgery for the enlarged action projection, pretrained bimanual action ordering, absolute joint targets, proprioceptive input, and multi-step flow matching. The dominant factor is not a small interface choice; it is robot pretraining.
Projection initialization, action ordering, relative targets, and proprioception each produce only modest drops when flipped alone.
Robot pretraining transfers across the embodiment gap; PaliGemma and random initialization collapse on robot.
One-step alternatives reduce latency but underperform, showing that closed-loop smoothness matters more than held-out action MSE.
Context 03
Heterogeneous co-training
Full whole-body teleoperation is expensive to scale. OpenHLM mixes the 8-task whole-body dataset with cheaper demonstrations for held-out tasks: stationary same-embodiment teleop and HuMI robot-free data.
8-task baseline
held-out tasks only
HuMI co-training
new objects and prompts
Stationary co-training
new semantics and motions
12-task teleop oracle
full whole-body teleop
Stationary teleoperation transfers both new semantics and new motions, lifting held-out progress from 36% to 76%. HuMI is cheaper and effective for new objects and instructions, but at this scale it is weaker for novel motion patterns such as pouring.
Context 04
System comparison
The final comparison uses a multi-site fruit task spanning a low coffee table, a medium table, and a tall shelf. The language instruction specifies an ordered pair of fruits; the robot must pick one with each hand and place them into separate shelf containers.
OpenHLM (HuMI co-training)
Whole-body teleop on 6 fruit pairs plus HuMI demonstrations for held-out pairs.
GR00T N1.6
Official checkpoint with the recommended fine-tuning protocol.
Psi0
Humanoid-pretrained baseline using decoupled whole-body control.
Context 05
In-the-wild
OpenHLM also runs outside the lab. In this outdoor spirit-disposal rollout, the humanoid is controlled by an autonomous VLA policy rather than by a teleoperator.
The policy must handle changed lighting, ground texture, background clutter, and scene geometry while executing the whole-body loco-manipulation sequence.