Mixture of Attention Variants for Modal Fusion in Multi-Modal Sentiment Analysis-Reference-Cited by-同舟云学术

Mixture of Attention Variants for Modal Fusion in Multi-Modal Sentiment Analysis

Published:2024-01-29 Issue:2 Volume:8 Page:14
ISSN:2504-2289
Container-title:Big Data and Cognitive Computing
language:en
Short-container-title:BDCC

Author:

He Chao¹²,Zhang Xinghua³,Song Dongqing¹,Shen Yingshan²,Mao Chengjie¹,Wen Huosheng⁴,Zhu Dingju⁴,Cai Lihua²⁴

Affiliation:

1. School of Computer Science, South China Normal University, Guangzhou 510631, China

2. Aberdeen Institute of Data Science and Artificial Intelligence, South China Normal University, Guangzhou 528225, China

3. International United College, South China Normal University, Guangzhou 528225, China

4. School of Software, South China Normal University, Guangzhou 528225, China

Abstract

With the popularization of better network access and the penetration of personal smartphones in today’s world, the explosion of multi-modal data, particularly opinionated video messages, has created urgent demands and immense opportunities for Multi-Modal Sentiment Analysis (MSA). Deep learning with the attention mechanism has served as the foundation technique for most state-of-the-art MSA models due to its ability to learn complex inter- and intra-relationships among different modalities embedded in video messages, both temporally and spatially. However, modal fusion is still a major challenge due to the vast feature space created by the interactions among different data modalities. To address the modal fusion challenge, we propose an MSA algorithm based on deep learning and the attention mechanism, namely the Mixture of Attention Variants for Modal Fusion (MAVMF). The MAVMF algorithm includes a two-stage process: in stage one, self-attention is applied to effectively extract image and text features, and the dependency relationships in the context of video discourse are captured by a bidirectional gated recurrent neural module; in stage two, four multi-modal attention variants are leveraged to learn the emotional contributions of important features from different modalities. Our proposed approach is end-to-end and has been shown to achieve a superior performance to the state-of-the-art algorithms when tested with two largest public datasets, CMU-MOSI and CMU-MOSEI.

Funder

Prof. Dingju Zhu

Publisher

MDPI AG

Link

https://www.mdpi.com/2504-2289/8/2/14/pdf

Reference65 articles.

1. Morency, L.P., Mihalcea, R., and Doshi, P. (2011, January 14–18). Towards multimodal sentiment analysis: Harvesting opinions from the web. Proceedings of the 13th International Conference on Multimodal Interfaces, Alicante, Spain.

2. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages;Zadeh;IEEE Intell. Syst.,2016

3. A review of affective computing: From unimodal analysis to multimodal fusion;Poria;Inf. Fusion,2017

4. Prakash, A., Chitta, K., and Geiger, A. (2021, January 20–25). Multi-modal fusion transformer for end-to-end autonomous driving. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.

5. A Comprehensive Review on Multimodal Dimensional Emotion Prediction;Li;Acta Autom. Sin.,2018