Self-Organizing Memory Based on Adaptive Resonance Theory for Vision and Language Navigation
-
Published:2023-10-07
Issue:19
Volume:11
Page:4192
-
ISSN:2227-7390
-
Container-title:Mathematics
-
language:en
-
Short-container-title:Mathematics
Author:
Wu Wansen1ORCID, Hu Yue1, Xu Kai1, Qin Long1ORCID, Yin Quanjun1
Affiliation:
1. College of Systems Engineering, National University of Defense Technology, Changsha 410073, China
Abstract
Vision and Language Navigation (VLN) is a task in which an agent needs to understand natural language instructions to reach the target location in a real-scene environment. To improve the model ability of long-horizon planning, emerging research focuses on extending the models with different types of memory structures, mainly including topological maps or a hidden state vector. However, the fixed-length hidden state vector is often insufficient to capture long-term temporal context. In comparison, topological maps have been shown to be beneficial for many robotic navigation tasks. Therefore, we focus on building a feasible and effective topological map representation and using it to improve the navigation performance and the generalization across seen and unseen environments. This paper presents a S elf-organizing Memory based on Adaptive Resonance Theory (SMART) module for incremental topological mapping and a framework for utilizing the SMART module to guide navigation. Based on fusion adaptive resonance theory networks, the SMART module can extract salient scenes from historical observations and build a topological map of the environmental layout. It provides a compact spatial representation and supports the discovery of novel shortcuts through inferences while being explainable in terms of cognitive science. Furthermore, given a language instruction and on top of the topological map, we propose a vision–language alignment framework for navigational decision-making. Notably, the framework utilizes three off-the-shelf pre-trained models to perform landmark extraction, node–landmark matching, and low-level controlling, without any fine-tuning on human-annotated datasets. We validate our approach using the Habitat simulator on VLN-CE tasks, which provides a photo-realistic environment for the embodied agent in continuous action space. The experimental results demonstrate that our approach achieves comparable performance to the supervised baseline.
Funder
National Natural Science Foundation of China Natural Science Fund of Hunan Province
Subject
General Mathematics,Engineering (miscellaneous),Computer Science (miscellaneous)
Reference51 articles.
1. Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., and Van Den Hengel, A. (2018, January 18–23). Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA. 2. Deng, Z., Narasimhan, K., and Russakovsky, O. (2020). Evolving graphical planner: Contextual global planning for vision-and-language navigation. arXiv. 3. Zhu, F., Liang, X., Zhu, Y., Yu, Q., Chang, X., and Liang, X. (2021, January 19–25). SOON: Scenario Oriented Object Navigation with Graph-based Exploration. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtually. 4. Chen, K., Chen, J.K., Chuang, J., Vázquez, M., and Savarese, S. (2021, January 20–25). Topological Planning with Transformers for Vision-and-Language Navigation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA. 5. History Aware Multimodal Transformer for Vision-and-Language Navigation;Chen;Adv. Neural Inf. Process. Syst.,2021
|
|