Affiliation:
1. Department of Electrical and Computer Engineering, Seoul National University, Gwanak-gu, South Korea
Abstract
Vision-based reinforcement learning (RL) is a generalizable way to control an agent because it is agnostic of specific hardware configurations. As visual observations are highly entangled, attempts for vision-based RL rely on scene representation that discerns individual entities and establishes intuitive physics to constitute the world model. However, most existing works on scene representation learning cannot successfully be deployed to train an RL agent, as they are often highly unstable and fail to sustain for a long enough temporal horizon. We propose ASIMO, a fully unsupervised scene decomposition to perform interaction-rich tasks with a vision-based RL agent. ASIMO decomposes agent-object interaction videos of episodic-length into the agent, objects, and background, predicting their long-term interactions. Further, we explicitly model possible occlusion in the image observations and stably track individual objects. Then, we can correctly deduce the updated positions of individual entities in response to the agent action, only from partial visual observation. Based on the stable entity-wise decomposition and temporal prediction, we formulate a hierarchical framework to train the RL agent that focuses on the context around the object of interest. We demonstrate that our formulation for scene representation can be universally deployed to train different configurations of agents and accomplish several tasks that involve pushing, arranging, and placing multiple rigid objects.
Funder
Institute of Information & Communications Technology Planning & Evaluation (IITP) Grant Funded by the Korea Government
Creative-Pioneering Researchers Program Through Seoul National University
National Research Foundation of Korea (NRF) Grant Funded by the Korea Government