How much does having visual priors about the world (e.g. the fact that the world is 3D) assist in learning to perform downstream motor tasks (e.g. delivering a package)?
We study this question by integrating a generic perceptual skill set (e.g. a distance estimator, an edge detector, etc.) within a reinforcement learning framework--see Figure 1.
This skill set (hereafter mid-level perception) provides the policy with a more processed state of the world compared to raw images.
We find that using a mid-level perception confers significant advantages over training end-to-end from scratch (i.e. not leveraging priors) in navigation-oriented tasks. Agents are able to generalize to situations where the from-scratch approach fails and training becomes significantly more sample efficient. However, we show that realizing these gains requires careful selection of the mid-level perceptual skills. Therefore, we refine our findings into an efficient max-coverage feature set that can be adopted in lieu of raw images. We perform our study in completely separate buildings for training and testing and compare against visually blind baseline policies and state-of-the-art feature learning methods.
Click here for a brief video overview.
Mid-level perception module in an end-to-end framework for learning active robotic tasks. We systematically study if/how a set of generic mid-level vision features can help with learning arbitrary downstream active tasks. We report significant advantages in sample efficiency and generalization. [the video of the Husky robot is by the Gibson Environment]
The figure below illustrates our approach: Left: Features warp the input distribution, potentially making the train and test distributions look more similar to the agent. Middle: The learned features from fixed encoder networks are used as the state for training policies in RL. Right: Downstream tasks prefer features which contain enough information to solve the task while remaining invariant to the changes in the input which are irrelevant for solving the tasks.
In order to evaluate the quality of our transfers, we tested in a new environment. Compare how well different types of vision allow agents to generalize, and whether that training performance is indicative of test performance. You can view these side-by-side agents that learn from scratch, and also against other informative controls such as a blind agent.
The provided policiy explorer gives a qualitative overview of what each agent sees and how it behaves. Compare agents that have access to different mid-level features: see sample trajectories, egocentric videos of the trajectories, and readouts of what the agent sees.
Example trajectories. Visualizations of sample the trajectories from different features.
We are currently working on demonstrating the results on a set of terrestrial robots, e.g. Minitaur (shown in the video).
Mid-Level Visual Representations Improve Generalization and Sample Efficiency for Learning Visuomotor Policies.
Sax, Emi, Zamir, Guibas, Savarese, Malik.