Relation-wise transformer network and reinforcement learning for visual navigation-Reference-Cited by-同舟云学术

Relation-wise transformer network and reinforcement learning for visual navigation

Published:2024-04-25 Issue:21 Volume:36 Page:13205-13221
ISSN:0941-0643
Container-title:Neural Computing and Applications
language:en
Short-container-title:Neural Comput & Applic

Author:

He Yu,Zhou Kang^ORCID

Abstract

AbstractThe task of object goal navigation is to drive an embodied agent to find the location of a given target only using visual observation. The mapping from visual perception of observation determines the navigation actions. Heterogeneous relationships in the observation are the essential part of the scene graph, which can guide the agent to find the target more easily. In this work, we propose a novel Heterogeneous Zone Graph Visual Transformer formulation for graph representation and visual perception. It consists of two key ideas: (1) Heterogeneous Zone Graph (HZG) that explores the heterogeneous target-related zones graph and spatial information. It allows the agent to navigate efficiently. (2) Relation-wise Transformer Network (RTNet) that transforms the relationship between previously observed objects and navigation actions. RTNet extracts rich nodes and edges features as pays more attention to the target-related zone. We model self-attention on the node-to-node encoder and cross-attention on the edge-to-node decoder. We evaluate our methods on the AI2THOR dataset and show superior navigation performance. Code and datasets can be found inhttps://github.com/zhoukang12321/RTNet_VN_2023.

Funder

Key Science and Technology Research of Henan Province, China

Key Science and Technology Research of Henan Province

City University of Hong Kong

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s00521-024-09693-z.pdf

Reference37 articles.

1. Amer F, Jung Y, Golparvar-Fard M (2021) Transformer machine learning language model for auto-alignment of long-term and short-term plans in construction. Autom Constr 132:103929

2. Ammirato P, Poirson P, Park E, Košecká J, Berg AC (2017) A dataset for developing and benchmarking active vision. In: 2017 IEEE international conference on robotics and automation (ICRA). IEEE, pp 1378–1385

3. Campari T, Eccher P, Serafini L, Ballan L (2020) Exploiting scene-specific features for object goal navigation. In: Computer vision–ECCV 2020 workshops: glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16. Springer, pp 406–421

4. Carion N, Massa F, Synnaeve G, Usunier NKirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision. Springer, pp 213–229

5. Chaplot DS, Gandhi D, Gupta S, Gupta A, Salakhutdinov R (2020) Learning to explore using active neural slam. In: International conference on learning representations