On Perception for Robotics

Visual Representations

Improve Generalization and Sample Efficiency
for Learning Visuomotor Policies

Winner of CVPR19 Habitat Embodied Agent Challenge [RGB Track]
Alexander Sax, Bradley Emi, Amir R. Zamir
Leonidas Guibas, Silvio Savarese, Jitendra Malik

Policy Explorer

See what the agent sees. Visualize and compare agents' behavior in unseen buildings.

Explorer Page

Performance Curves

Quantitatively compare agents performance in new environments.

View Curves


The paper and supplementary material describing the methodology and evaluation.


Overview Video

Brief video summary of the paper, methodology, and results.

See video (YouTube)


View code, install the visualpriors package, and run experiments via Docker.

Get started


How much does having visual priors about the world (e.g. the fact that the world is 3D) assist in learning to perform downstream motor tasks (e.g. delivering a package)? We study this question by integrating a generic perceptual skill set (e.g. a distance estimator, an edge detector, etc.) within a reinforcement learning framework--see Figure 1. This skill set (hereafter mid-level perception) provides the policy with a more processed state of the world compared to raw images.

We find that using a mid-level perception confers significant advantages over training end-to-end from scratch (i.e. not leveraging priors) in navigation-oriented tasks. Agents are able to generalize to situations where the from-scratch approach fails and training becomes significantly more sample efficient. However, we show that realizing these gains requires careful selection of the mid-level perceptual skills. Therefore, we refine our findings into an efficient max-coverage feature set that can be adopted in lieu of raw images. We perform our study in completely separate buildings for training and testing and compare against visually blind baseline policies and state-of-the-art feature learning methods.

Click here for a brief video overview.

Mid-level perception module in an end-to-end framework for learning active robotic tasks. We systematically study if/how a set of generic mid-level vision features can help with learning arbitrary downstream active tasks. We report significant advantages in sample efficiency and generalization. [the video of the Husky robot is by the Gibson Environment]

Mid-Level Visual Representations

We use a set of standard and imperfect visual estimators (e.g. depth, vanishing points, objects, etc.) and refer to them as mid-level vision tasks. We use Taskonomy CVPR18's task bank for this purpose. Frame-by-frame results of the mid-level visual estimators for a sample video is shown below.

Robotic tasks used in the study

We study if mid-level vision can improve learning downstream (active) tasks, compared to both learning perception from scratch and also state-of-the-art representation learning techniques. We evaluate the efficacy of these approaches based on how quickly the active task is learned and how well the policies generalize to unseen test spaces. For the mid-level features, we do not care about the task-specific performance of visual estimators or their related vision-based metrics--as our sole goal is the downstream task. Three sample robotic tasks were used in the study: visual-target navigation, maximum coverage visual exploration, and visual local planning.

Each column in the figure below shows a representative episode for a policy on a specific task: the two rows show the drastically different behaviors with/without a mid-level perception. The advantage of using a mid-level perception (top) is clear. All policies are tested in completely different buildings than those seen during training. For more, head to the policy explorer page to see more examples for any choice of features.

Studied Questions

We distill our analysis into three questions:
HI. Whether mid-level vision improve the learning speed (answer: yes)
HII. Whether mid-level vision provides an advantage when generalizing to unseen spaces (yes)
HIII. Whether a fixed mid-level feature can suffice or if a set of features is required for supporting arbitrary motor tasks (a set is essential).
We use statistical tests to answer these questions where appropriate.

The figure below illustrates our approach: Left: Features warp the input distribution, potentially making the train and test distributions look more similar to the agent. Middle: The learned features from fixed encoder networks are used as the state for training policies in RL. Right: Downstream tasks prefer features which contain enough information to solve the task while remaining invariant to the changes in the input which are irrelevant for solving the tasks.

Performance Curves

In order to evaluate the quality of our transfers, we tested in a new environment. Compare how well different types of vision allow agents to generalize, and whether that training performance is indicative of test performance. You can view these side-by-side agents that learn from scratch, and also against other informative controls such as a blind agent.

Generalization Gap. We use mid-level vision to reduce the performance gap between training and test environments.

Policy Explorer

The provided policiy explorer gives a qualitative overview of what each agent sees and how it behaves. Compare agents that have access to different mid-level features: see sample trajectories, egocentric videos of the trajectories, and readouts of what the agent sees.

Example trajectories. Visualizations of sample the trajectories from different features.

What's next

We are currently working on demonstrating the results on a set of terrestrial robots, e.g. Minitaur (shown in the video).


Mid-Level Visual Representations Improve Generalization and Sample Efficiency for Learning Visuomotor Policies.

Sax, Emi, Zamir, Guibas, Savarese, Malik.

 title={Mid-Level Visual Representations Improve Generalization and Sample Efficiency for Learning Visuomotor Policies.},
 author={Alexander Sax and Bradley Emi and Amir R. Zamir and Leonidas J. Guibas and Silvio Savarese and Jitendra Malik},



Alexander (Sasha) Sax

UC Berkeley, FAIR

Bradley Emi


Amir Zamir

Stanford, UC Berkeley

Leonidas Guibas

Stanford, FAIR

Silvio Savarese


Jitendra Malik

UC Berkeley, FAIR