Multimodal modeling of human emotions using sound, image and text fusion-Reference-Cited by-同舟云学术

Multimodal modeling of human emotions using sound, image and text fusion

Published:2023-02-20 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Hosseini Seyed Sadegh¹,Yamaghani Mohammad Reza¹,Arabani Soodabeh Poorzaker¹

Affiliation:

1. Islamic Azad University, Lahijan Branch

Abstract

Abstract Multimodal emotion recognition and analysis is considered a developing research field. Improving the multimodal fusion mechanism plays a key role in the more detailed recognition of the recognized emotion. The present study aimed to optimize the performance of the emotion recognition system and presented a model for multimodal emotion recognition from audio, text, and video data. First, the data were fused as a combination of video and audio, then as a combination of audio and text as binary, and finally the results were fused together. The final output included audio, text, and video data taking common features into account. Then, the convolutional neural network, as well as long-term and short-term memory (CNN-LSTM), were used to extract audio. Next, the Inception-Res Net-v2 network was applied for extracting the facial expression in the video. The output fused data were utilized by LSTM as the input of the softmax classifier to recognize the emotion of audio and video features fusion. In addition, the CNN-LSTM was combined in the form of a binary channel for learning audio emotion features. Meanwhile, a Bi-LSTM network was used to extract the text features and softmax was used for classifying the fused features. Finally, the generated results were fused together for the final classification, and the logistic regression model was used for fusion and classification. The results indicated that the recognition accuracy of the proposed method in the IEMOCAP data set was 82.9.

Publisher

Research Square Platform LLC

Reference40 articles.

1. Baltrusaitis T, Robinson P, Morency L,3DConstrainedLocalModelforrigidandnon-rigidfacialtracking,in:2012IEEEConf.Comput.Vis.PatternRecognit., Providence IEEE (2012) RI,pp.2610–2617,https://doi.org/10.1109/CVPR.2012.6247980

2. Error weighted semi-coupled hidden markov model for audio-visual emotion recognition;Lin J-C;IEEE Trans Multimed,2012

3. An appraisal on speech and emotion recognition technologies based on machine learning;Andy C,2020

4. Adaptive Gaussian mixture model-based statistical feature extraction for computer-aided diagnosis of micro-calcification clusters in mammograms;Zhang Z,2020

5. Face recognition from video frames using hidden markov model classification model based on modified random feature extraction;Vivekanandam B,2019