On Perception for Robotics

Mid-Level Visual Representations
Improve Generalization and Sample Efficiency
for Learning Active Tasks

Alexander Sax, Bradley Emi, Amir R. Zamir
Leonidas Guibas, Silvio Savarese, Jitendra Malik
(webpage under construction)

Policy Explorer

See what the agent sees. Visualize and compare agents' behavior in unseen buildings.

Explorer Page

Performance Curves

Quantitatively compare agents performance in new environments.

View Curves


The paper and supplementary material describing the methodology and evaluation.



Frame-aligned visualizations of the different mid-level features.

See videos (coming soon)

Analysis Code

Download code for hypothesis testing in RL.

Download Code (coming soon)


One of the ultimate promises of computer vision is to help robotic agents perform active tasks, like delivering packages or doing household chores. However, the conventional approach to solving "vision" is to define a set of offline recognition problems (e.g. object detection) and solve those first. This approach faces a challenge from the recent rise of Deep Reinforcement Learning frameworks that learn active tasks from scratch using images as input. This poses a set of fundamental questions: what is the role of computer vision if everything can be learned from scratch? Could intermediate vision tasks actually be useful for performing arbitrary downstream active tasks?

We show that proper use of mid-level perception confers significant advantages over training from scratch. We implement a perception module as a set of mid-level visual representations and demonstrate that learning active tasks with mid-level features is significantly more sample-efficient than scratch and able to generalize in situations where the from-scratch approach fails. However, we show that realizing these gains requires careful selection of the particular mid-level features for each downstream task. Finally, we put forth a simple and efficient perception module based on the results of our study, which can be adopted as a rather generic perception module for active frameworks.

Mid-level perception module in an end-to-end framework for learning active robotic tasks. We systematically study if/how a set of generic mid-level vision features can help with learning arbitrary downstream active tasks. We report significant advantages in sample efficiency and generalization. [the video of the Husky robot is by the Gibson Environment]

Mid-Level Visual Representations

We use a set of standard and imperfect visual estimators (e.g. depth, vanishing points, objects, etc.) and refer to them as mid-level vision tasks. We use Taskonomy CVPR18's task bank for this purpose. Frame-by-frame results of the mid-level visual estimators for a sample video is shown below.

Robotic tasks used in the study

We study if mid-level vision can provide benefits towards learning the downstream active task, compared to not adopting a perception. Our metrics are how quickly the active task is learned and how well the policies generalize to unseen test spaces. We do not care about the task-specific performance of mid-level visual estimators or their vision-based metrics, as our sole goal is the downstream active task and mid-level vision is only in service to that. Three sample robotic tasks were used in the study: visual navigation to an object, maximum coverage visual exploration, and visual local planning. Each column shows a sample execution for each task, the two rows show the impact of using a mid-level perception module on the performance of each for a sample episode. The drastic advantage of using a mid-level perception module is apparent. All policies are trained and tested in completely different buildings. You can use the policy explorer page to see more examples for any choice of features.

Studied Questions

We test three core hypotheses:
HI. if mid-level vision provides an advantage in terms of sample efficiency of learning an active task (answer: yes)
HII. if mid-level vision provides an advantage towards generalization to unseen spaces (yes)
HIII. if a fixed mid-level vision feature could suffice or a set of features would be essential to support arbitrary active tasks (answer: a set is essential).

We use statistical tests to answer these questions where appropriate. The figure below illustratesof experimental setup for testing our three core hypotheses. Left: Plate-notation view of the transfer learning setup where internal representations from the encoder network(s) are used as inputs to various RL policies. Right: Illustrations of the hypotheses. Features Φi (also illustrated by the readout images) are ranked by performance on to the downstream task. Red lines identify features that have a higher rank for task 1 while blue lines connect features that have a higher rank for task 2. For HI and HII: Some features are ranked significantly above scratch. For HIII: The feature ranking reorders between tasks.

Performance Curves

In order to evaluate the quality of our transfers, we tested in a new environment. Compare how well different types of vision allow agents to generalize, and whether that training performance is indicative of test performance. You can view these side-by-side agents that learn from scratch, and also against other informative controls such as a blind agent.

Generalization Gap. We use mid-level vision to reduce the performance gap between training and test environments.

Policy Explorer

The provided policiy explorer gives a qualitative overview of what each agent sees and how it behaves. Compare agents that have access to different mid-level features: see sample trajectories, egocentric videos of the trajectories, and readouts of what the agent sees.

Example trajectories. Visualizations of sample the trajectories from different features.

What's next

We are currently working on demonstrating the results on a set of terrestrial robots, e.g. Minitaur (shown in the video).


Mid-Level Visual Representations Improve Generalization and Sample Efficiency for Learning Active Tasks.
Sax, Emi, Zamir, Guibas, Savarese, Malik.

 title={Mid-Level Visual Representations Improve Generalization and Sample Efficiency.},
 author={Alexander Sax and Bradley Emi and Amir R. Zamir and Leonidas J. Guibas and Silvio Savarese and Jitendra Malik},


Alexander (Sasha) Sax

UC Berkeley, FAIR

Bradley Emi


Amir Zamir

Stanford, UC Berkeley

Leonidas Guibas

Stanford, FAIR

Silvio Savarese


Jitendra Malik

UC Berkeley, FAIR