Path-Wise Attention Memory Network for Visual Question Answering-Reference-Cited by-同舟云学术

Path-Wise Attention Memory Network for Visual Question Answering

Published:2022-09-07 Issue:18 Volume:10 Page:3244
ISSN:2227-7390
Container-title:Mathematics
language:en
Short-container-title:Mathematics

Author:

Xiang Yingxin,Zhang Chengyuan^ORCID,Han Zhichao,Yu Hao,Li Jiaye,Zhu Lei^ORCID

Abstract

Visual question answering (VQA) is regarded as a multi-modal fine-grained feature fusion task, which requires the construction of multi-level and omnidirectional relations between nodes. One main solution is the composite attention model which is composed of co-attention (CA) and self-attention(SA). However, the existing composite models only consider the stack of single attention blocks, lack of path-wise historical memory, and overall adjustments. We propose a path attention memory network (PAM) to construct a more robust composite attention model. After each single-hop attention block (SA or CA), the importance of the cumulative nodes is used to calibrate the signal strength of nodes’ features. Four memoried single-hop attention matrices are used to obtain the path-wise co-attention matrix of path-wise attention (PA); therefore, the PA block is capable of synthesizing and strengthening the learning effect on the whole path. Moreover, we use guard gates of the target modal to check the source modal values in CA and conditioning gates of another modal to guide the query and key of the current modal in SA. The proposed PAM is beneficial to construct a robust multi-hop neighborhood relationship between visual and language and achieves excellent performance on both VQA2.0 and VQA-CP V2 datasets.

Funder

National Natural Science Foundation of China

Natural Science Foundation of Hunan Province

Publisher

MDPI AG

Subject

General Mathematics,Engineering (miscellaneous),Computer Science (miscellaneous)

Link

https://www.mdpi.com/2227-7390/10/18/3244/pdf

Reference63 articles.

1. Robust Deep Multi-Modal Learning Based on Gated Information Fusion Network;Kim;Proceedings of the Asian Conference on Computer Vision,2018

2. Unpaired Multi-Modal Segmentation via Knowledge Distillation

3. Deep Multi-Modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges

4. MDETR-modulated detection for end-to-end multi-modal understanding;Kamath;Proceedings of the IEEE/CVF International Conference on Computer Vision,2021

5. Robust Sparse Weighted Classification For Crowdsourcing

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. The multi-modal fusion in visual question answering: a review of attention mechanisms;PeerJ Computer Science;2023-05-30