A critical bottleneck limiting imitation learning in robotics is the lack of data. This problem is more severe in mobile manipulation, where collecting demonstrations is harder than in stationary manipulation due to the lack of available and easy-to-use teleoperation interfaces. In this work, we demonstrate TeleMoMa, a general and modular interface for whole-body teleoperation of mobile manipulators. TeleMoMa unifies multiple human interfaces including RGB and depth cameras, virtual reality controllers, keyboard, joysticks, etc., and any combination thereof. In its more accessible version, TeleMoMa works using simply vision (e.g., an RGB-D camera), lowering the entry bar for humans to provide mobile manipulation demonstrations. We demonstrate the versatility of TeleMoMa by teleoperating several existing mobile manipulators -- PAL Tiago++, Toyota HSR, and Fetch -- in simulation and the real world. We demonstrate the quality of the demonstrations collected with TeleMoMa by training imitation learning policies for mobile manipulation tasks involving synchronized whole-body motion. Finally, we also show that TeleMoMa"s teleoperation channel enables teleoperation on site, looking at the robot, or remote, sending commands and observations through a computer network, and perform user studies to evaluate how easy it is for novice users to learn to collect demonstrations with different combinations of human interfaces enabled by our system. We hope TeleMoMa becomes a helpful tool for the community enabling researchers to collect whole-body mobile manipulation demonstrations.
TeleMoMa consists of three components: the Human Interface acquires commands from the human using different input devices; the Teleoperation Channel defines the action command structure between the human and the robot interfaces, and, closes the loop with observations from the robot; and the Robot Interface implements a robot-specific mapping of actions to low-level robot commands. This architecture enables modularity and versatility -- combining multiple devices to achieve intuitive whole-body teleoperation for multiple tasks and robots.
TeleMoMa supports modularity, for instance, the Human Interface can be a combination of different input devices, such as RGB and depth cameras, virtual reality controllers, mobile phones, spacemouse etc., and any combination thereof. For instance, while vision based teleoperation allows a more natural teleoperation, there is imprecisions in hand tracking due to the limitations of the human tracking model. On the other hand, virtual reality controllers provide a more precise control, but are less intuitive. TeleMoMa allows the user to use a combination of these interfaces, thus combining the best of both worlds and providing a more intuitive and precise teleoperation.
@article{dass2024telemoma,
title={TeleMoMa: A Modular and Versatile Teleoperation System for Mobile Manipulation},
author={Dass, Shivin and Ai, Wensi and Jiang, Yuqian and Singh, Samik and Hu, Jiaheng and Zhang, Ruohan and Stone, Peter and Abbatematteo, Ben and Martín-Martín, Roberto},
journal={arXiv preprint arXiv:2403.07869},
year={2024}
}