R3M: Reusable Representations for Robotic Manipulation

Summary

This paper studies, if visual representations pre-trained on diverse human video data (Ego4D) can enable data-efficient robot learning on multiple downstream manipulation tasks.
They propose R3M, that combines time-contrastive learning, video-language alignment and L1 penalty to encourage sparse and compact representations.
It improves task success by over 20% when training from scratch, and by over 10% when compared to CLIP & MoCo.

A common method to train a robot to complete a manipulation task form images is to train an end-to-end model from scratch using data from the same domain.
CV & NLP have focused on using diverse large datasets to build reusable, pre-trained representations. Similarly for robotics, we don’t have a pre-trained model that can downloaded and used for any downstream manipulation task.
Why have we struggled in building this universal representation for robotics?
- Collecting large and diverse datasets of real-world robotic tasks can be costly.
- Although recent studies has created multiple datasets (RoboNet, RoboTurk), they consist of a limited number of tasks in at most a handful of different environments.
- This lack of diversity makes it difficult to learn visual representations.
We observe that representations in CV & NLP didn’t arise out of task-specific and carefully curated datasets, but rather the use of abundant in-the wild data. ******For robotics, we already have access to videos of humans interacting in semantically interesting ways with their environments.
Can visual representations pre-trained on diverse human videos enable efficient downstream learning of robotic manipulation skills?
- A good representation should capture the temporal dynamics of the scene, semantic priors and shouldn’t include any irrelevant features.
They propose Reusable Representation for Robotic Manipulation (R3M), that can be used as a