Multi-level textual-visual alignment and fusion network for multimodal aspect-based sentiment analysis-Reference-Cited by-同舟云学术

Multi-level textual-visual alignment and fusion network for multimodal aspect-based sentiment analysis

Published:2024-03-01 Issue:4 Volume:57 Page:
ISSN:1573-7462
Container-title:Artificial Intelligence Review
language:en
Short-container-title:Artif Intell Rev

Author:

Li You,Ding Han,Lin Yuming,Feng Xinyu,Chang Liang

Abstract

AbstractMultimodal Aspect-Based Sentiment Analysis (MABSA) is an essential task in sentiment analysis that has garnered considerable attention in recent years. Typical approaches in MABSA often utilize cross-modal Transformers to capture interactions between textual and visual modalities. However, bridging the semantic gap between modalities spaces and addressing interference from irrelevant visual objects at different scales remains challenging. To tackle these limitations, we present the Multi-level Textual-Visual Alignment and Fusion Network (MTVAF) in this work, which incorporates three auxiliary tasks. Specifically, MTVAF first transforms multi-level image information into image descriptions, facial descriptions, and optical characters. These are then concatenated with the textual input to form a textual+visual input, facilitating comprehensive alignment between visual and textual modalities. Next, both inputs are fed into an integrated text model that incorporates relevant visual representations. Dynamic attention mechanisms are employed to generate visual prompts to control cross-modal fusion. Finally, we align the probability distributions of the textual input space and the textual+visual input space, effectively reducing noise introduced during the alignment process. Experimental results on two MABSA benchmark datasets demonstrate the effectiveness of the proposed MTVAF, showcasing its superior performance compared to state-of-the-art approaches. Our codes are available at https://github.com/MKMaS-GUET/MTVAF.

Funder

National Natural Science Foundation of China

Innovation Project of GUET Graduate Education

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s10462-023-10685-z.pdf

Reference58 articles.

1. Borth D, Ji R, Chen T, Breuel T, Chang S-F (2013) Large-scale visual sentiment ontology and detectors using adjective noun pairs. ACM multimedia conference. Association for Computing Machinery, New York, pp 223–232

2. Chen Y-C, Li L, Yu L, El Kholy A, Ahmed F, Gan Z, Cheng Y, Liu J (2020) Uniter: Universal image-text representation learning. In: European conference on computer vision, pp. 104–120 . https://doi.org/10.1007/978-3-030-58577-8_7

3. Chen Q, Ling Z-H, Zhu X (2018) Enhancing sentence embedding with generalized pooling. Proceedings of the 27th international conference on computational linguistics. Association for Computational Linguistics, Santa Fe

4. Chen T, Borth D, Darrell T, Chang S (2014) Deepsentibank: Visual sentiment concept classification with deep convolutional neural networks. CoRR abs/1410.8586

5. Chen Y, Gong S, Bazzani L (2020) Image search with text feedback by visiolinguistic attention learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) . https://doi.org/10.1109/CVPR42600.2020.00307

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Relevance-aware visual entity filter network for multimodal aspect-based sentiment analysis;International Journal of Machine Learning and Cybernetics;2024-08-30

2. A shared-private sentiment analysis approach based on cross-modal information interaction;Pattern Recognition Letters;2024-07