NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models-Reference-Cited by-同舟云学术

NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models

Published:2024-03-24 Issue:7 Volume:38 Page:7641-7649
ISSN:2374-3468
Container-title:Proceedings of the AAAI Conference on Artificial Intelligence
language:
Short-container-title:AAAI

Author:

Zhou Gengze,Hong Yicong,Wu Qi

Abstract

Trained with an unprecedented scale of data, large language models (LLMs) like ChatGPT and GPT-4 exhibit the emergence of significant reasoning abilities from model scaling. Such a trend underscored the potential of training LLMs with unlimited language data, advancing the development of a universal embodied agent. In this work, we introduce the NavGPT, a purely LLM-based instruction-following navigation agent, to reveal the reasoning capability of GPT models in complex embodied scenes by performing zero-shot sequential action prediction for vision-and-language navigation (VLN). At each step, NavGPT takes the textual descriptions of visual observations, navigation history, and future explorable directions as inputs to reason the agent's current status, and makes the decision to approach the target. Through comprehensive experiments, we demonstrate NavGPT can explicitly perform high-level planning for navigation, including decomposing instruction into sub-goals, integrating commonsense knowledge relevant to navigation task resolution, identifying landmarks from observed scenes, tracking navigation progress, and adapting to exceptions with plan adjustment. Furthermore, we show that LLMs is capable of generating high-quality navigational instructions from observations and actions along a path, as well as drawing accurate top-down metric trajectory given the agent's navigation history. Despite the performance of using NavGPT to zero-shot R2R tasks still falling short of trained models, we suggest adapting multi-modality inputs for LLMs to use as visual navigation agents and applying the explicit reasoning of LLMs to benefit learning-based models. Code is available at: https://github.com/GengzeZhou/NavGPT.

Publisher

Association for the Advancement of Artificial Intelligence (AAAI)

Cited by 8 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. DOZE: A Dataset for Open-Vocabulary Zero-Shot Object Navigation in Dynamic Environments;IEEE Robotics and Automation Letters;2024-09

2. Unlocking Robotic Autonomy: A Survey on the Applications of Foundation Models;International Journal of Control, Automation and Systems;2024-08

3. Unlocking underrepresented use-cases for large language model-driven human-robot task planning;Advanced Robotics;2024-07-04

4. Demo Abstract: Embodied Aerial Agent for City-level Visual Language Navigation Using Large Language Model;2024 23rd ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN);2024-05-13

5. Discuss Before Moving: Visual Language Navigation via Multi-expert Discussions;2024 IEEE International Conference on Robotics and Automation (ICRA);2024-05-13