Summary
- This paper studies, if visual representations pre-trained on diverse human video data (Ego4D) can enable data-efficient robot learning on multiple downstream manipulation tasks.
- They propose R3M, that combines time-contrastive learning, video-language alignment and L1 penalty to encourage sparse and compact representations.
- It improves task success by over 20% when training from scratch, and by over 10% when compared to CLIP & MoCo.
Motivation
- A common method to train a robot to complete a manipulation task form images is to train an end-to-end model from scratch using data from the same domain.
- CV & NLP have focused on using diverse large datasets to build reusable, pre-trained representations. Similarly for robotics, we don’t have a pre-trained model that can downloaded and used for any downstream manipulation task.
- Why have we struggled in building this universal representation for robotics?
- Collecting large and diverse datasets of real-world robotic tasks can be costly.
- Although recent studies has created multiple datasets (RoboNet, RoboTurk), they consist of a limited number of tasks in at most a handful of different environments.
- This lack of diversity makes it difficult to learn visual representations.
- We observe that representations in CV & NLP didn’t arise out of task-specific and carefully curated datasets, but rather the use of abundant in-the wild data. ******For robotics, we already have access to videos of humans interacting in semantically interesting ways with their environments.
- Can visual representations pre-trained on diverse human videos enable efficient downstream learning of robotic manipulation skills?
- A good representation should capture the temporal dynamics of the scene, semantic priors and shouldn’t include any irrelevant features.
- They propose Reusable Representation for Robotic Manipulation (R3M), that can be used as a