Whole-body humanoid loco-manipulation

OpenHLM: An Empirical Recipe for Whole-Body Humanoid Loco-Manipulation

System-comparison rollout
12 language-conditioned benchmark tasks
89% average progress on the 8 whole-body training tasks
87.5% system-comparison fruit task progress with HuMI co-training
1.14 h demo time in the system comparison setting

Recipe

A controlled roadmap for whole-body humanoid VLAs

OpenHLM studies how to collect full-body demonstrations, how to adapt a robot-pretrained VLA to a 34-D humanoid action space, and how cheaper heterogeneous data extends the policy beyond full whole-body teleoperation. The policy treats the humanoid as one coordinated kinematic chain rather than a wheeled base with arms.

G1 Whole-body native

One policy commands arms, waist, knees, feet, and grippers.

G2 Language-steerable

A single checkpoint follows task prompts across many skills.

G3 Extensible through cheap data

HuMI and stationary data reduce the need for full-body teleop.

Benchmark first

HLM-12 benchmark

HLM-12 contains twelve language-conditioned real-robot tasks spanning pick-and-place with locomotion, whole-body workspace extension, body parts used as manipulators, and constrained contact-rich motion. The task list is explicitly written in the source in alphabetical order for easy editing.

Body as manipulator

Bottle Disposal

01 / 12

Environmental constraint

Cart Pushing

02 / 12

Pick-and-place with locomotion

Cola Placement

03 / 12

Pick-and-place with locomotion

Gum Can Placement

04 / 12

Environmental constraint

Jar Opening

05 / 12

Pick-and-place with locomotion

Pig Placement

06 / 12

Environmental constraint

Pouring

07 / 12

Whole-body workspace extension

Shelf Cube Transfer

08 / 12

Whole-body workspace extension

Shelf Cup Transfer

09 / 12

Environmental constraint

Shuttlecock Setup

10 / 12

Environmental constraint

Sword Extraction

11 / 12

Body as manipulator

Toy Stowing

12 / 12

Examples from the HLM-12 benchmark, spanning locomotion, manipulation, squatting, foot interaction, and constrained contact.

Context 01

Controller and teleoperation

The low-level controller tracks whole-body reference commands. The teleoperation interface therefore determines both what behaviors can be demonstrated and what action space the VLA later learns. OpenHLM adopts joint-based whole-body teleoperation: full-body motion capture retargeted online to robot joint space, tracked with 0.2 s future-frame preview.

Teleop method comparison

Switch the metric to inspect task progress, rollout time, or mean footsteps. Slashed cells mark tasks the method cannot perform by construction.

Method
Cola Placement
Shelf Cup Transfer
Bottle Disposal
Decoupled control
66.7%
93.3%
VR 3-point teleoperation
40.0%
Joint-based whole-body teleop.
86.7%
80.0%
85.0%
21-D action

Decoupled control

Mobile-manipulation style control; the foot cannot be used as a manipulator.

24-D action

VR 3-point

Sparse head-and-hands control gives only indirect command over the lower body.

32-D action

Joint-based whole-body

Every joint is commanded, enabling squatting, full-body stepping, and foot interaction.

Joint-space retargeting 88% progress vs. 75% for native SMPL recording.
Preview latency 0.2 s balances teleop responsiveness and locomotion smoothness.

Context 02

VLA policy design

OpenHLM starts from a robot-pretrained VLA and adapts the interface to the humanoid: weight surgery for the enlarged action projection, pretrained bimanual action ordering, absolute joint targets, proprioceptive input, and multi-step flow matching. The dominant factor is not a small interface choice; it is robot pretraining.

OpenHLM default
91.2%
Random action projection
87.5%
Humanoid-native ordering
88.3%
Relative actions
84.2%
No proprioception
86.7%
PaliGemma init
59.6%
Random init
41.7%
One-step flow
70.8%
Drifting model
69.2%
Interface ablations

Projection initialization, action ordering, relative targets, and proprioception each produce only modest drops when flipped alone.

Pretraining

Robot pretraining transfers across the embodiment gap; PaliGemma and random initialization collapse on robot.

Inference

One-step alternatives reduce latency but underperform, showing that closed-loop smoothness matters more than held-out action MSE.

Context 03

Heterogeneous co-training

Full whole-body teleoperation is expensive to scale. OpenHLM mixes the 8-task whole-body dataset with cheaper demonstrations for held-out tasks: stationary same-embodiment teleop and HuMI robot-free data.

36% held-out progress

8-task baseline

held-out tasks only

65% held-out progress

HuMI co-training

new objects and prompts

76% held-out progress

Stationary co-training

new semantics and motions

94% held-out progress

12-task teleop oracle

full whole-body teleop

Stationary teleoperation transfers both new semantics and new motions, lifting held-out progress from 36% to 76%. HuMI is cheaper and effective for new objects and instructions, but at this scale it is weaker for novel motion patterns such as pouring.

Context 04

System comparison

The final comparison uses a multi-site fruit task spanning a low coffee table, a medium table, and a tall shelf. The language instruction specifies an ordered pair of fruits; the robot must pick one with each hand and place them into separate shelf containers.

OpenHLM (HuMI co-training)

Whole-body teleop on 6 fruit pairs plus HuMI demonstrations for held-out pairs.

87.5% progress 1.14 h demo time

GR00T N1.6

Official checkpoint with the recommended fine-tuning protocol.

57.5% progress 2.70 h demo time

Psi0

Humanoid-pretrained baseline using decoupled whole-body control.

48.8% progress 2.70 h demo time
OpenHLM
87.5%
GR00T
57.5%
Psi0
48.8%

Context 05

In-the-wild

OpenHLM also runs outside the lab. In this outdoor spirit-disposal rollout, the humanoid is controlled by an autonomous VLA policy rather than by a teleoperator.

Outdoor autonomous rollout To our knowledge, this is the first outdoor VLA robot policy demonstration on a humanoid fully controlled by an autonomous policy.

The policy must handle changed lighting, ground texture, background clutter, and scene geometry while executing the whole-body loco-manipulation sequence.