Bridging the Imitation Gap by Adaptive Insubordination

Luca Weihs*, Unnat Jain*, Jordi Salvador,
Svetlana Lazebnik, Aniruddha Kembhavi, Alexander Schwing
* equal contribution by LW and UJ

University of Illinois at Urbana-Champaign, Allen Institute for AI, University of Washington

arXiv 2020



Why do agents often obtain better reinforcement learning policies when imitating a worse expert? We show that privileged information used by the expert is marginalized in the learned agent policy, resulting in an "imitation gap." Prior work bridges this gap via a progression from imitation learning to reinforcement learning. While often successful, gradual progression fails for tasks that require frequent switches between exploration and memorization skills. To better address these tasks and alleviate the imitation gap we propose 'Adaptive Insubordination' (ADVISOR), which dynamically reweights imitation and reward-based reinforcement learning losses during training, enabling switching between imitation and exploration. On a suite of challenging tasks, we show that ADVISOR outperforms pure imitation, pure reinforcement learning, as well as sequential combinations of these approaches.

Adaptive Insubordination (ADVISOR)

Specifically, during training we use an auxiliary actor which judges whether the current observation is better treated using an IL or a RL loss. For this, the auxiliary actor attempts to reproduce the expert’s action from the perspective of the learning agent at every step. Intuitively, the weight corresponding to the IL loss is large when the auxiliary actor can reproduce the expert’s action with high confidence and is otherwise small. As we show empirically, ADVISOR combines the benefits of IL and RL while avoiding the pitfalls of either method alone.


Environments

We test ADVISOR on a range of environments. We introduce two didactic environment -- 2D-Lighthouse (grid-world) and Poisoned Doors to empirically parallel our theoretical understanding of ADVISOR. We also test it on popular within the MiniGrid environment and introduce more challenging variants. Particularly, we build upon the WallCrossing and LavaCrossing tasks by injecting natural noise to the visual observations or policy of the expert. For more details, please see our paper.



References

(1) Luca Weihs*, Unnat Jain*, Jordi Salvador, Svetlana Lazebnik, Aniruddha Kembhavi, Alexander Schwing. Bridging the Imitation Gap by Adaptive Insubordination. In arXiv 2020 [Bibtex]


Acknowledgements

This material is based upon work supported in part by the National Science Foundation under Grants No. 1563727, 1718221, 1637479, 165205, 1703166, Samsung, 3M, Sloan Fellowship, NVIDIA Artificial Intelligence Lab, Allen Institute for AI, Amazon, AWS Research Awards, and Siebel Scholars Award. We thank Iou-Jen Liu, Nan Jiang, and Tanmay Gangwani for feedback on this work.

Website adapted from Jingxiang, Richard and Deepak.