A Proposal for Multimodal Emotion Recognition Using Aural Transformers and Action Units on RAVDESS Dataset-Reference-Cited by-同舟云学术

A Proposal for Multimodal Emotion Recognition Using Aural Transformers and Action Units on RAVDESS Dataset

Published:2021-12-30 Issue:1 Volume:12 Page:327
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Luna-Jiménez Cristina^ORCID,Kleinlein Ricardo^ORCID,Griol David^ORCID,Callejas Zoraida^ORCID,Montero Juan M.^ORCID,Fernández-Martínez Fernando^ORCID

Abstract

Emotion recognition is attracting the attention of the research community due to its multiple applications in different fields, such as medicine or autonomous driving. In this paper, we proposed an automatic emotion recognizer system that consisted of a speech emotion recognizer (SER) and a facial emotion recognizer (FER). For the SER, we evaluated a pre-trained xlsr-Wav2Vec2.0 transformer using two transfer-learning techniques: embedding extraction and fine-tuning. The best accuracy results were achieved when we fine-tuned the whole model by appending a multilayer perceptron on top of it, confirming that the training was more robust when it did not start from scratch and the previous knowledge of the network was similar to the task to adapt. Regarding the facial emotion recognizer, we extracted the Action Units of the videos and compared the performance between employing static models against sequential models. Results showed that sequential models beat static models by a narrow difference. Error analysis reported that the visual systems could improve with a detector of high-emotional load frames, which opened a new line of research to discover new ways to learn from videos. Finally, combining these two modalities with a late fusion strategy, we achieved 86.70% accuracy on the RAVDESS dataset on a subject-wise 5-CV evaluation, classifying eight emotions. Results demonstrated that these modalities carried relevant information to detect users’ emotional state and their combination allowed to improve the final system performance.

Funder

Ministry of Economy, Industry and Competitiveness

Ministerio de Educación Cultura y Deporte

European Commission

Agencia Estatal de Investigación

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/12/1/327/pdf

Reference76 articles.

1. The Role of Trust in Proactive Conversational Assistants

2. Embodied Conversational Agents;Cassell,2000

3. From ‘automation’ to ‘autonomy’: the importance of trust repair in human–machine interaction

4. Driver Emotion Recognition for Intelligent Vehicles

5. An Ambient Intelligence-Based Human Behavior Monitoring Framework for Ubiquitous Environments

Cited by 42 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Multimodal Emotion Recognition Using Visual, Vocal and Physiological Signals: A Review;Applied Sciences;2024-09-09

2. Optimized efficient attention-based network for facial expressions analysis in neurological health care;Computers in Biology and Medicine;2024-09

3. Speech emotion recognition using the novel SwinEmoNet (Shifted Window Transformer Emotion Network);International Journal of Speech Technology;2024-07-10

4. Using transformers for multimodal emotion recognition: Taxonomies and state of the art review;Engineering Applications of Artificial Intelligence;2024-07

5. Breaking the Silence: Whisper-Driven Emotion Recognition in AI Mental Support Models;2024 IEEE Conference on Artificial Intelligence (CAI);2024-06-25