How much does having visual priors about the world (e.g. the fact that the world is 3D) assist in learning to perform downstream motor tasks (e.g. delivering a package)?
We study this question by integrating a generic perceptual skill set (e.g. a distance estimator, an edge detector, etc.) within a reinforcement learning framework--see Figure 1.
This skill set (hereafter mid-level perception) provides the policy with a more processed state of the world compared to raw images.
We find that using a mid-level perception confers significant advantages over training end-to-end from scratch (i.e. not leveraging priors) in navigation-oriented tasks. Agents are able to generalize to situations where the from-scratch approach fails and training becomes significantly more sample efficient. However, we show that realizing these gains requires careful selection of the mid-level perceptual skills. This suggests that using a set of features could provide better generic perception, and we computationally derive and experimentally validate an example of such a set: the max-coverage feature set that can be adopted in lieu of raw images. We perform our study using completely separate buildings for training and testing and compare against multiple controls and state-of-the-art feature learning methods.
Click here for a brief video overview.
Mid-level perception in an end-to-end framework for learning visuomotor tasks. We systematically study if/how a set of generic mid-level vision features can help with learning arbitrary downstream visuomotor tasks. We report significant advantages in sample efficiency, generalization, and final performance.
[The above video of the Husky robot is in the Gibson Environment]
The figure below illustrates our approach: Left: Features warp the input distribution, potentially making the train and test distributions look more similar to the agent. Middle: The learned features from fixed encoder networks are used as the state for training policies in RL. Right: Downstream tasks prefer features that contain enough information to solve the task while remaining invariant to the task-irrelevant details.
Compare how well different types of vision enable agents to learn, and whether that training performance is indicative of test performance. In order to evaluate the trained policies, we tested them in environments unseen during training. The following tool allows you to compare training and test curves for agents trained for any task using any of the representations--and to view these side-by-side against informative controls, such as agents trained from scratch.
The provided policiy explorer gives a qualitative overview of what each agent sees and how it behaves. Compare agents that have access to different mid-level features: see sample trajectories, egocentric videos along those trajectories, and readouts of what the agent sees.
Example trajectories. Visualizations of sample the trajectories from different features.
Learning to Navigate Using Mid-Level Visual Priors. (CoRL '19)
Mid-Level Visual Representations Improve Generalization and Sample Efficiency for Learning Visuomotor Policies. (Arxiv '18)
Sax, Zhang, Emi, Zamir, Guibas, Savarese, Malik.