Generalization Curve Explorer

We evaluated every policy in both the train and test spaces during various checkpoints throughout training. Below you can compare the performance of any combination of policies to see how well different types of visual features perform on different tasks in the train and test spaces. You can contrast those against different types of controls: Scratch learns "tabula rasa" (i.e. using no mid-level visual features) while Blind blocks all visual information, but keeps everything else identical, including reward, action space, etc (the Blind policy calibrates how much solving the task actually requires visual information). Test curves are shown as a dark solid line while training curves are lighter and use a dot-dash pattern. The results offer quantitative insight into what types of vision is necessary for different types of active tasks and also underscores sample efficiency and the importance of employing an unseen test space to include generalization in the evaluations. You can explore these results qualitatively using the videos here.