A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face-Reference-Cited by-同舟云学术

A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face

Published:2023-10-12 Issue:10 Volume:25 Page:1440
ISSN:1099-4300
Container-title:Entropy
language:en
Short-container-title:Entropy

Author:

Lian Hailun¹²,Lu Cheng¹³,Li Sunan¹²,Zhao Yan¹²^ORCID,Tang Chuangao¹³,Zong Yuan¹³

Affiliation:

1. Key Laboratory of Child Development and Learning Science (Ministry of Education), Southeast University, Nanjing 210000, China

2. School of Information Science and Engineering, Southeast University, Nanjing 210000, China

3. School of Biological Science and Medical Engineering, Southeast University, Nanjing 210000, China

Abstract

Multimodal emotion recognition (MER) refers to the identification and understanding of human emotional states by combining different signals, including—but not limited to—text, speech, and face cues. MER plays a crucial role in the human–computer interaction (HCI) domain. With the recent progression of deep learning technologies and the increasing availability of multimodal datasets, the MER domain has witnessed considerable development, resulting in numerous significant research breakthroughs. However, a conspicuous absence of thorough and focused reviews on these deep learning-based MER achievements is observed. This survey aims to bridge this gap by providing a comprehensive overview of the recent advancements in MER based on deep learning. For an orderly exposition, this paper first outlines a meticulous analysis of the current multimodal datasets, emphasizing their advantages and constraints. Subsequently, we thoroughly scrutinize diverse methods for multimodal emotional feature extraction, highlighting the merits and demerits of each method. Moreover, we perform an exhaustive analysis of various MER algorithms, with particular focus on the model-agnostic fusion methods (including early fusion, late fusion, and hybrid fusion) and fusion based on intermediate layers of deep models (encompassing simple concatenation fusion, utterance-level interaction fusion, and fine-grained interaction fusion). We assess the strengths and weaknesses of these fusion strategies, providing guidance to researchers to help them select the most suitable techniques for their studies. In summary, this survey aims to provide a thorough and insightful review of the field of deep learning-based MER. It is intended as a valuable guide to aid researchers in furthering the evolution of this dynamic and impactful field.

Funder

National Key R & D Project

Zhishan Young Scholarship of Southeast University

Postdoctoral Scientific Research Foundation of Southeast University

Jiangsu Province Excellent Postdoctoral Program

Publisher

MDPI AG

Subject

General Physics and Astronomy

Link

https://www.mdpi.com/1099-4300/25/10/1440/pdf

Reference133 articles.

1. Survey on speech emotion recognition: Features, classification schemes, and databases;Kamel;Pattern Recognit.,2011

2. Databases, features and classifiers for speech emotion recognition: A review;Swain;Int. J. Speech Technol.,2018

3. Zong, Y., Lian, H., Chang, H., Lu, C., and Tang, C. (2022). Adapting Multiple Distributions for Bridging Emotions from Different Speech Corpora. Entropy, 24.

4. Fu, H., Zhuang, Z., Wang, Y., Huang, C., and Duan, W. (2023). Cross-Corpus Speech Emotion Recognition Based on Multi-Task Learning and Subdomain Adaptation. Entropy, 25.

5. Lu, C., Tang, C., Zhang, J., and Zong, Y. (2022). Progressively Discriminative Transfer Network for Cross-Corpus Speech Emotion Recognition. Entropy, 24.

Cited by 12 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A systematic review of trimodal affective computing approaches: Text, audio, and visual integration in emotion recognition and sentiment analysis;Expert Systems with Applications;2024-12

2. CLASSIFICATION OF CUSTOMER SENTIMENTS BASED ON ONLINE REVIEWS: COMPARATIVE ANALYSIS OF MACHINE LEARNING AND DEEP LEARNING ALGORITHMS;Kahramanmaraş Sütçü İmam Üniversitesi Mühendislik Bilimleri Dergisi;2024-09-03

3. A Deep GRU-BiLSTM Network for Multi-modal Emotion Recognition from Text;2024 IEEE 7th International Conference on Advanced Technologies, Signal and Image Processing (ATSIP);2024-07-11

4. Modeling Continuous Emotions in Text Data using IEMOCAP Database;2024 IEEE 7th International Conference on Advanced Technologies, Signal and Image Processing (ATSIP);2024-07-11

5. Convolution SSM model for text emotion classification;Third International Symposium on Computer Applications and Information Systems (ISCAIS 2024);2024-07-11