Abstract
AbstractThe task of object goal navigation is to drive an embodied agent to find the location of a given target only using visual observation. The mapping from visual perception of observation determines the navigation actions. Heterogeneous relationships in the observation are the essential part of the scene graph, which can guide the agent to find the target more easily. In this work, we propose a novel Heterogeneous Zone Graph Visual Transformer formulation for graph representation and visual perception. It consists of two key ideas: (1) Heterogeneous Zone Graph (HZG) that explores the heterogeneous target-related zones graph and spatial information. It allows the agent to navigate efficiently. (2) Relation-wise Transformer Network (RTNet) that transforms the relationship between previously observed objects and navigation actions. RTNet extracts rich nodes and edges features as pays more attention to the target-related zone. We model self-attention on the node-to-node encoder and cross-attention on the edge-to-node decoder. We evaluate our methods on the AI2THOR dataset and show superior navigation performance. Code and datasets can be found inhttps://github.com/zhoukang12321/RTNet_VN_2023.
Funder
Key Science and Technology Research of Henan Province, China
Key Science and Technology Research of Henan Province
City University of Hong Kong
Publisher
Springer Science and Business Media LLC
Reference37 articles.
1. Amer F, Jung Y, Golparvar-Fard M (2021) Transformer machine learning language model for auto-alignment of long-term and short-term plans in construction. Autom Constr 132:103929
2. Ammirato P, Poirson P, Park E, Košecká J, Berg AC (2017) A dataset for developing and benchmarking active vision. In: 2017 IEEE international conference on robotics and automation (ICRA). IEEE, pp 1378–1385
3. Campari T, Eccher P, Serafini L, Ballan L (2020) Exploiting scene-specific features for object goal navigation. In: Computer vision–ECCV 2020 workshops: glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16. Springer, pp 406–421
4. Carion N, Massa F, Synnaeve G, Usunier NKirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision. Springer, pp 213–229
5. Chaplot DS, Gandhi D, Gupta S, Gupta A, Salakhutdinov R (2020) Learning to explore using active neural slam. In: International conference on learning representations