Mid-Level Visual Representations for Improving Generalization and Sample Efficiency of Visuomotor Policies

[NEW: Nov 2020] See our followup for manipulation tasks and on physical robots at CoRL 2020!

In Conference on Robot Learning (CoRL) 2019 + as an oral at BayLearn 2019

Winner of CVPR19 Habitat Embodied Agent Challenge [RGB Track]

Alexander Sax, Jeffrey O. Zhang, Bradley Emi, Amir R. Zamir, Leonidas Guibas, Silvio Savarese, Jitendra Malik

Policy Explorer

See what the agent sees. Visualize and compare agents' behavior in unseen buildings.

Explorer Page

Performance Curves

Quantitatively compare agents' performance in new environments.

View Curves


The paper and supplementary material describing the methodology and evaluation.


Overview Video

Brief video summary of the paper, methodology, and results.

See video (YouTube)


View code, install the visualpriors package, and run experiments via Docker.

Get started


How much does having visual priors about the world (e.g. the fact that the world is 3D) assist in learning to perform downstream motor tasks (e.g. delivering a package)? We study this question by integrating a generic perceptual skill set (e.g. a distance estimator, an edge detector, etc.) within a reinforcement learning framework--see Figure 1. This skill set (hereafter mid-level perception) provides the policy with a more processed state of the world compared to raw images.

We find that using a mid-level perception confers significant advantages over training end-to-end from scratch (i.e. not leveraging priors) in navigation-oriented tasks. Agents are able to generalize to situations where the from-scratch approach fails and training becomes significantly more sample efficient. However, we show that realizing these gains requires careful selection of the mid-level perceptual skills. This suggests that using a set of features could provide better generic perception, and we computationally derive and experimentally validate an example of such a set: the max-coverage feature set that can be adopted in lieu of raw images. We perform our study using completely separate buildings for training and testing and compare against multiple controls and state-of-the-art feature learning methods.

Click here for a brief video overview.

Mid-level perception in an end-to-end framework for learning visuomotor tasks. We systematically study if/how a set of generic mid-level vision features can help with learning arbitrary downstream visuomotor tasks. We report significant advantages in sample efficiency, generalization, and final performance.
[The above video of the Husky robot is in the Gibson Environment]

Mid-Level Visual Representations

We use a set of standard and imperfect visual estimators (e.g. depth, vanishing points, objects, etc.) and refer to them as mid-level vision objectives. We use Taskonomy CVPR18's task bank for this purpose. Frame-by-frame results of the mid-level visual estimators for a sample video is shown below.

Robotic tasks used in the study

We study if mid-level vision can improve learning downstream (active) tasks, compared to both learning perception from scratch and also state-of-the-art representation learning techniques. We evaluate the efficacy of these approaches based on how quickly the visuomotor task is learned and how well the policies generalize to unseen test spaces. For the mid-level features, we do not care about the objective-specific performance of visual estimators or their related vision-based metrics--as our sole goal is the downstream task. Three sample robotic tasks were used in the study: visual-target navigation, maximum coverage visual exploration, and visual local planning.

Each column in the figure below shows a representative episode for a policy on a specific task: the two rows show the drastically different behaviors with/without a mid-level perception. The advantage of using a mid-level perception (top) is clear. All policies are tested in completely different buildings than those seen during training. For more, head to the policy explorer page to see more examples for any choice of features.

Studied Questions

We distill our analysis into three questions:
HI. Whether mid-level vision improves the learning speed (answer: yes)
HII. Whether mid-level vision provides an advantage when generalizing to unseen spaces (yes)
HIII. Whether a fixed mid-level feature can suffice or if a set of features is required for supporting arbitrary motor tasks (a set is essential).
We use statistical tests to answer these questions where appropriate.

The figure below illustrates our approach: Left: Features warp the input distribution, potentially making the train and test distributions look more similar to the agent. Middle: The learned features from fixed encoder networks are used as the state for training policies in RL. Right: Downstream tasks prefer features that contain enough information to solve the task while remaining invariant to the task-irrelevant details.

Performance Curves

Compare how well different types of vision enable agents to learn, and whether that training performance is indicative of test performance. In order to evaluate the trained policies, we tested them in environments unseen during training. The following tool allows you to compare training and test curves for agents trained for any task using any of the representations--and to view these side-by-side against informative controls, such as agents trained from scratch.

Generalization Gap. We use mid-level vision to reduce the performance gap between training and test environments.

Policy Explorer

The provided policiy explorer gives a qualitative overview of what each agent sees and how it behaves. Compare agents that have access to different mid-level features: see sample trajectories, egocentric videos along those trajectories, and readouts of what the agent sees.

Example trajectories. Visualizations of sample the trajectories from different features.


Learning to Navigate Using Mid-Level Visual Priors. (CoRL '19)

Mid-Level Visual Representations Improve Generalization and Sample Efficiency for Learning Visuomotor Policies. (Arxiv '18)

Sax, Zhang, Emi, Zamir, Guibas, Savarese, Malik.

 title={Mid-Level Visual Representations Improve Generalization and Sample Efficiency for Learning Visuomotor Policies.},
 author={Alexander Sax and Bradley Emi and Amir R. Zamir and Leonidas J. Guibas and Silvio Savarese and Jitendra Malik},

BayLearn 2019


Alexander (Sasha) Sax

UC Berkeley, FAIR

Jeffrey O. Zhang

UC Berkeley

Bradley Emi


Amir Zamir

Stanford, UC Berkeley

Leonidas Guibas

Stanford, FAIR

Silvio Savarese


Jitendra Malik

UC Berkeley, FAIR