ASIMO: Agent-centric scene representation in multi-object manipulation-Reference-Cited by-同舟云学术

ASIMO: Agent-centric scene representation in multi-object manipulation

Published:2024-06-10 Issue: Volume: Page:
ISSN:0278-3649
Container-title:The International Journal of Robotics Research
language:en
Short-container-title:The International Journal of Robotics Research

Author:

Min Cheol-Hui¹^ORCID,Kim Young Min¹^ORCID

Affiliation:

1. Department of Electrical and Computer Engineering, Seoul National University, Gwanak-gu, South Korea

Abstract

Vision-based reinforcement learning (RL) is a generalizable way to control an agent because it is agnostic of specific hardware configurations. As visual observations are highly entangled, attempts for vision-based RL rely on scene representation that discerns individual entities and establishes intuitive physics to constitute the world model. However, most existing works on scene representation learning cannot successfully be deployed to train an RL agent, as they are often highly unstable and fail to sustain for a long enough temporal horizon. We propose ASIMO, a fully unsupervised scene decomposition to perform interaction-rich tasks with a vision-based RL agent. ASIMO decomposes agent-object interaction videos of episodic-length into the agent, objects, and background, predicting their long-term interactions. Further, we explicitly model possible occlusion in the image observations and stably track individual objects. Then, we can correctly deduce the updated positions of individual entities in response to the agent action, only from partial visual observation. Based on the stable entity-wise decomposition and temporal prediction, we formulate a hierarchical framework to train the RL agent that focuses on the context around the object of interest. We demonstrate that our formulation for scene representation can be universally deployed to train different configurations of agents and accomplish several tasks that involve pushing, arranging, and placing multiple rigid objects.

Funder

Institute of Information & Communications Technology Planning & Evaluation (IITP) Grant Funded by the Korea Government

Creative-Pioneering Researchers Program Through Seoul National University

National Research Foundation of Korea (NRF) Grant Funded by the Korea Government

Publisher

SAGE Publications

Link

https://journals.sagepub.com/doi/pdf/10.1177/02783649241257537

Reference102 articles.

1. Caron M, Touvron H, Misra I, et al. (2021) Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, 10–17 October 2021, 9650–9660.