Today I read a paper titled "Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots." This research was conducted jointly by Stanford University, Columbia University, and the Toyota Research Institute.
It introduces a data collection and policy learning framework called the Universal Manipulation Interface (UMI), which enables skills directly acquired from in-the-wild human demonstrations to be transferred into deployable robot policies. UMI uses a handheld gripper combined with a carefully designed interface for portable, low-cost, and information-rich data collection suitable for challenging bimanual and dynamic manipulation demonstrations. To facilitate the learning of deployable policies, UMI incorporates a carefully designed policy interface with matched inference latency and relative trajectory action representation. The resulting learned policies are hardware-agnostic and can be deployed across multiple robot platforms. Equipped with these features, the UMI framework unlocks new robotic manipulation capabilities, enabling zero-shot generalization of dynamic, bimanual, precise, and long-horizon behaviors simply by changing the training data for each task. We demonstrate the versatility and effectiveness of UMI through comprehensive real-world experiments, showing that policies learned via UMI on trained human demonstrations can generalize zero-shot to new environments and objects.
For example, we have long awaited robots to wash dishes:
To successfully wash dishes, the robot must sequentially execute seven interdependent actions: turning on the faucet, grasping the plate, picking up the sponge, cleaning and wiping the plate until the ketchup is removed, placing the plate down, putting the sponge away, and turning off the faucet.
Hardware Design
The data collection hardware of UMI uses a handheld parallel gripper with a GoPro camera mounted on it. To collect observation data usable for policy deployment, UMI needs to capture sufficient visual context to infer actions and acquire key information such as depth. To obtain action data leading to deployable policies, UMI needs to capture precise robot actions under fast human motions, fine-tune grip width, and automatically check whether each demonstration is valid under specific robot kinematic constraints.
With its unique wrist-mounted camera setup and camera-centric action representation, UMI achieves 100% calibration-free operation (even when the base moves) and can resist disturbances and drastic lighting changes.
UMI Policy Interface Design
We synchronize different streams of observational data using physically measured delays.
The UMI policy receives a sequence of synchronized observations (RGB images, relative end-effector (EE) poses, and gripper width) and outputs a sequence of desired relative end-effector poses and gripper widths as actions.
We send action commands in advance to compensate for the robot's execution delay.
In-the-wild Generalization Experiments
With UMI, you can go to any home or restaurant and start collecting data within two minutes.
Using a diverse in-the-wild cup manipulation dataset, UMI enables us to train a diffusion policy that can generalize to extreme out-of-distribution objects and environments, including serving espresso cups above a water fountain!
Narrow-Domain Evaluation Results
All initial states of the evaluation episodes are overlaid together.
For each task, all methods start from the same set of initial states, which are manually matched via reference images.
Typical failure modes of baseline/ablation policies. Red arrows indicate failure behaviors, green arrows indicate desired behaviors. Success rates over 20 evaluation episodes, with the best-performing entries in each column highlighted in bold.