One of the ultimate promises of computer vision is to help robotic agents perform active tasks, like delivering packages or doing household chores. However, the conventional approach to solving "vision" is to define a set of offline recognition problems (e.g. object detection) and solve those first. This approach faces a challenge from the recent rise of Deep Reinforcement Learning frameworks that learn active tasks from scratch using images as input. This poses a set of fundamental questions: what is the role of computer vision if everything can be learned from scratch? Could intermediate vision tasks actually be useful for performing arbitrary downstream active tasks?
We show that proper use of mid-level perception confers significant advantages over training from scratch. We implement a perception module as a set of mid-level visual representations and demonstrate that learning active tasks with mid-level features is significantly more sample-efficient than scratch and able to generalize in situations where the from-scratch approach fails. However, we show that realizing these gains requires careful selection of the particular mid-level features for each downstream task. Finally, we put forth a simple and efficient perception module based on the results of our study, which can be adopted as a rather generic perception module for active frameworks.
Mid-level perception module in an end-to-end framework for learning active robotic tasks. We systematically study if/how a set of generic mid-level vision features can help with learning arbitrary downstream active tasks. We report significant advantages in sample efficiency and generalization. [the video of the Husky robot is by the Gibson Environment]
In order to evaluate the quality of our transfers, we tested in a new environment. Compare how well different types of vision allow agents to generalize, and whether that training performance is indicative of test performance. You can view these side-by-side agents that learn from scratch, and also against other informative controls such as a blind agent.
The provided policiy explorer gives a qualitative overview of what each agent sees and how it behaves. Compare agents that have access to different mid-level features: see sample trajectories, egocentric videos of the trajectories, and readouts of what the agent sees.
Example trajectories. Visualizations of sample the trajectories from different features.
We are currently working on demonstrating the results on a set of terrestrial robots, e.g. Minitaur (shown in the video).
Mid-Level Visual Representations Improve Generalization and Sample Efficiency for Learning Active Tasks.
Sax, Emi, Zamir, Guibas, Savarese, Malik.