Author:
Li You,Ding Han,Lin Yuming,Feng Xinyu,Chang Liang
Abstract
AbstractMultimodal Aspect-Based Sentiment Analysis (MABSA) is an essential task in sentiment analysis that has garnered considerable attention in recent years. Typical approaches in MABSA often utilize cross-modal Transformers to capture interactions between textual and visual modalities. However, bridging the semantic gap between modalities spaces and addressing interference from irrelevant visual objects at different scales remains challenging. To tackle these limitations, we present the Multi-level Textual-Visual Alignment and Fusion Network (MTVAF) in this work, which incorporates three auxiliary tasks. Specifically, MTVAF first transforms multi-level image information into image descriptions, facial descriptions, and optical characters. These are then concatenated with the textual input to form a textual+visual input, facilitating comprehensive alignment between visual and textual modalities. Next, both inputs are fed into an integrated text model that incorporates relevant visual representations. Dynamic attention mechanisms are employed to generate visual prompts to control cross-modal fusion. Finally, we align the probability distributions of the textual input space and the textual+visual input space, effectively reducing noise introduced during the alignment process. Experimental results on two MABSA benchmark datasets demonstrate the effectiveness of the proposed MTVAF, showcasing its superior performance compared to state-of-the-art approaches. Our codes are available at https://github.com/MKMaS-GUET/MTVAF.
Funder
National Natural Science Foundation of China
Innovation Project of GUET Graduate Education
Publisher
Springer Science and Business Media LLC
Reference58 articles.
1. Borth D, Ji R, Chen T, Breuel T, Chang S-F (2013) Large-scale visual sentiment ontology and detectors using adjective noun pairs. ACM multimedia conference. Association for Computing Machinery, New York, pp 223–232
2. Chen Y-C, Li L, Yu L, El Kholy A, Ahmed F, Gan Z, Cheng Y, Liu J (2020) Uniter: Universal image-text representation learning. In: European conference on computer vision, pp. 104–120 . https://doi.org/10.1007/978-3-030-58577-8_7
3. Chen Q, Ling Z-H, Zhu X (2018) Enhancing sentence embedding with generalized pooling. Proceedings of the 27th international conference on computational linguistics. Association for Computational Linguistics, Santa Fe
4. Chen T, Borth D, Darrell T, Chang S (2014) Deepsentibank: Visual sentiment concept classification with deep convolutional neural networks. CoRR abs/1410.8586
5. Chen Y, Gong S, Bazzani L (2020) Image search with text feedback by visiolinguistic attention learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) . https://doi.org/10.1109/CVPR42600.2020.00307
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献