Abstract
<abstract>
<p>Multimodal emotion analysis involves the integration of information from various modalities to better understand human emotions. In this paper, we propose the Cross-modal Emotion Recognition based on multi-layer semantic fusion (CM-MSF) model, which aims to leverage the complementarity of important information between modalities and extract advanced features in an adaptive manner. To achieve comprehensive and rich feature extraction from multimodal sources, considering different dimensions and depth levels, we design a parallel deep learning algorithm module that focuses on extracting features from individual modalities, ensuring cost-effective alignment of extracted features. Furthermore, a cascaded cross-modal encoder module based on Bidirectional Long Short-Term Memory (BILSTM) layer and Convolutional 1D (ConV1d) is introduced to facilitate inter-modal information complementation. This module enables the seamless integration of information across modalities, effectively addressing the challenges associated with signal heterogeneity. To facilitate flexible and adaptive information selection and delivery, we design the Mask-gated Fusion Networks (MGF-module), which combines masking technology with gating structures. This approach allows for precise control over the information flow of each modality through gating vectors, mitigating issues related to low recognition accuracy and emotional misjudgment caused by complex features and noisy redundant information. The CM-MSF model underwent evaluation using the widely recognized multimodal emotion recognition datasets CMU-MOSI and CMU-MOSEI. The experimental findings illustrate the exceptional performance of the model, with binary classification accuracies of 89.1% and 88.6%, as well as F1 scores of 87.9% and 88.1% on the CMU-MOSI and CMU-MOSEI datasets, respectively. These results unequivocally validate the effectiveness of our approach in accurately recognizing and classifying emotions.</p>
</abstract>
Publisher
American Institute of Mathematical Sciences (AIMS)
Reference40 articles.
1. R. K. Patra, B. Patil, T. S. Kumar, G. Shivakanth, B. M. Manjula, Machine learning based sentiment analysis and swarm intelligence, in 2023 IEEE International Conference on Integrated Circuits and Communication Systems (ICICACS), IEEE, (2023), 1–8. https://doi.org/10.1109/ICICACS57338.2023.10100262
2. R. Das, T. D. Singh, Multimodal sentiment analysis: A survey of methods, trends, and challenges, ACM Comput. Surv., 55 (2023), 1–38. https://doi.org/10.1145/3586075
3. S. Peng, K. Chen, T. Tian, J. Chen, An autoencoder-based feature level fusion for speech emotion recognition, Digital Commun. Networks, 2022. https://doi.org/10.1016/j.dcan.2022.10.018
4. S. Yoon, S. Byun, K. Jung, Multimodal speech emotion recognition using audio and text, in 2018 IEEE Spoken Language Technology Workshop (SLT), IEEE, (2018), 112–118. https://doi.org/10.1109/SLT.2018.8639583
5. E. Jeong, G. Kim, S. Kang, Multimodal prompt learning in emotion recognition using context and audio information, Mathematics, 11 (2023), 2908. https://doi.org/10.3390/math11132908