Multimodal Attention-Based Instruction-Following Part-Level Affordance Grounding-Reference-Cited by-同舟云学术

Multimodal Attention-Based Instruction-Following Part-Level Affordance Grounding

Published:2024-05-29 Issue:11 Volume:14 Page:4696
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Qu Wen¹,Guo Lulu¹,Cui Jian¹,Jin Xiao¹

Affiliation:

1. Computer Science and Technology, Dalian Maritime University, Gaoxin District, Dalian 116026, China

Abstract

The integration of language and vision for object affordance understanding is pivotal for the advancement of embodied agents. Current approaches are often limited by reliance on segregated pre-processing stages for language interpretation and object localization, leading to inefficiencies and error propagation in affordance segmentation. To overcome these limitations, this study introduces a unique task, part-level affordance grounding, in direct response to natural language instructions. We present the Instruction-based Affordance Grounding Network (IAG-Net), a novel architecture that unifies language–vision interactions through a varied-scale multimodal attention mechanism. Unlike existing models, IAG-Net employs two textual–visual feature fusion strategies, capturing both sentence-level and task-specific textual features alongside multiscale visual features for precise and efficient affordance prediction. Our evaluation on two newly constructed vision–language affordance datasets, ITT-AFF VL and UMD VL, demonstrates a significant leap in performance, with an improvement of 11.78% and 0.42% in mean Intersection over Union (mIoU) over cascaded models, bolstering both accuracy and processing speed. We contribute to the research community by releasing our source code and datasets, fostering further innovation and replication of our findings.

Publisher

MDPI AG

Link

https://www.mdpi.com/2076-3417/14/11/4696/pdf

Reference78 articles.

1. Vision-language navigation: A survey and taxonomy;Wu;Neural Comput. Appl.,2024

2. Ding, Z., Sun, Y., Xu, S., Pan, Y., Peng, Y., and Mao, Z. (2023). Recent Advances and Perspectives in Deep Learning Techniques for 3D Point Cloud Data Processing. Robotics, 12.

3. RSSI Map-Based Trajectory Design for UGV Against Malicious Radio Source: A Reinforcement Learning Approach;Han;IEEE Trans. Intell. Transp. Syst.,2023

4. Tell me dave: Context-sensitive grounding of natural language to manipulation instructions;Misra;Int. J. Robot. Res.,2016

5. Matuszek, C. (2018). Grounded language learning: Where robotics and nlp meet. Proc. Int. Jt. Conf. Artif. Intell., 5687–5691.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Knowledge enhanced bottom-up affordance grounding for robotic interaction;PeerJ Computer Science;2024-07-05