Mid-Level Visual Representations Improve Generalization and Sample Efficiency for Learning Active Tasks
https://perceptual.actor/
Abstract
One of the ultimate promises of computer vision is to help robotic agents perform active tasks, like delivering packages or doing household chores. However, the conven- tional approach to solving “vision” is to define a set of of- fline recognition problems (e.g. object detection) and solve those first. This approach faces a challenge from the re- cent rise of Deep Reinforcement Learning frameworks that learn active tasks from scratch using images as input. This poses a set of fundamental questions: what is the role of computer vision if everything can be learned from scratch? Could intermediate vision tasks actually be useful for per- forming arbitrary downstream active tasks?
We show that proper use of mid-level perception confers significant advantages over training from scratch. We im- plement a perception module as a set of mid-level visual representations and demonstrate that learning active tasks with mid-level features is significantly more sample-efficient than scratch and able to generalize in situations where the from-scratch approach fails. However, we show that realiz- ing these gains requires careful selection of the particular mid-level features for each downstream task. Finally, we put forth a simple and efficient perception module based on the results of our study, which can be adopted as a rather generic perception module for active frameworks.
We test three core hypotheses:
I. if mid-level vision pro- vides an advantage in terms of sample efficiency of learning an active task (answer: yes)
II. if mid-level vision provides an advantage towards generalization to unseen spaces (an- swer: yes)
III. if a fixed mid-level vision feature could suf- fice or a set of features would be essential to support arbi- trary active tasks (answer: a set is essential).
Hypothesis I: Does mid-level vision provide an advantage in terms of sample efficiency when learning an active task?
Hypothesis II: Can mid-level vision features generalize better to unseen spaces?
Hypothesis III: Can a single feature support all arbitrary downstream tasks? Or is a set of features required for that?