Multi-Modal Sentiment Analysis Based on Image and Text Fusion Based on Cross-Attention Mechanism-Reference-Cited by-同舟云学术

Multi-Modal Sentiment Analysis Based on Image and Text Fusion Based on Cross-Attention Mechanism

Published:2024-05-27 Issue:11 Volume:13 Page:2069
ISSN:2079-9292
Container-title:Electronics
language:en
Short-container-title:Electronics

Author:

Li Hongchan¹,Lu Yantong¹,Zhu Haodong¹

Affiliation:

1. School of Computer Science and Technology, Zhengzhou University of Light Industry, Zhengzhou 450002, China

Abstract

Research on uni-modal sentiment analysis has achieved great success, but emotions in real life are mostly multi-modal; there are not only texts but also images, audio, video, and other forms. The various modes play a role in mutual promotion. If the connection between various modalities can be mined, the accuracy of sentiment analysis will be further improved. To this end, this paper introduces a cross-attention-based multi-modal fusion model for images and text, namely, MCAM. First, we use the ALBert pre-training model to extract text features for text; then, we use BiLSTM to extract text context features; then, we use DenseNet121 to extract image features for images; and then, we use CBAM to extract specific areas related to emotion in images. Finally, we utilize multi-modal cross-attention to fuse the extracted features from the text and image, and we classify the output to determine the emotional polarity. In the experimental comparative analysis of MVSA and TumEmo public datasets, the model in this article is better than the baseline model, with accuracy and F1 scores reaching 86.5% and 75.3% and 85.5% and 76.7%, respectively. In addition, we also conducted ablation experiments, which confirmed that sentiment analysis with multi-modal fusion is better than single-modal sentiment analysis.

Funder

Henan Provincial Science and Technology Project

Publisher

MDPI AG

Link

https://www.mdpi.com/2079-9292/13/11/2069/pdf

Reference42 articles.

1. Hu, R., Rui, L., Zeng, P., Chen, L., and Fan, X. (2018, January 7–10). Text sentiment analysis: A review. Proceedings of the 2018 IEEE 4th International Conference on Computer and Communications (ICCC), Chengdu, China.

2. Cai, Z., Cao, D., and Ji, R. (2015). Video (GIF) sentiment analysis using large-scale mid-level ontology. arXiv.

3. Xu, C., Cetintas, S., Lee, K.C., and Li, L.J. (2014). Visual sentiment prediction with deep convolutional neural networks. arXiv.

4. Tang, D., Qin, B., and Liu, T. (2015, January 26–31). Learning semantic representations of users and products for document level sentiment classification. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China.

5. Ibrahim, M., Abdillah, O., Wicaksono, A.F., and Adriani, M. (2015, January 14–17). Buzzer detection and sentiment analysis for predicting presidential election results in a twitter nation. Proceedings of the 2015 IEEE International Conference on Data Mining Workshop (ICDMW), Atlantic City, NJ, USA.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. EmotionCast: An Emotion-Driven Intelligent Broadcasting System for Dynamic Camera Switching;Sensors;2024-08-21