Joint modelling of audio-visual cues using attention mechanisms for emotion recognition-Reference-Cited by-同舟云学术

Joint modelling of audio-visual cues using attention mechanisms for emotion recognition

Published:2022-08-05 Issue:8 Volume:82 Page:11239-11264
ISSN:1380-7501
Container-title:Multimedia Tools and Applications
language:en
Short-container-title:Multimed Tools Appl

Author:

Ghaleb Esam^ORCID,Niehues Jan,Asteriadis Stylianos

Abstract

AbstractEmotions play a crucial role in human-human communications with complex socio-psychological nature. In order to enhance emotion communication in human-computer interaction, this paper studies emotion recognition from audio and visual signals in video clips, utilizing facial expressions and vocal utterances. Thereby, the study aims to exploit temporal information of audio-visual cues and detect their informative time segments. Attention mechanisms are used to exploit the importance of each modality over time. We propose a novel framework that consists of bi-modal time windows spanning short video clips labeled with discrete emotions. The framework employs two networks, with each one being dedicated to one modality. As input to a modality-specific network, we consider a time-dependent signal deriving from the embeddings of the video and audio modalities. We employ the encoder part of the Transformer on the visual embeddings and another one on the audio embeddings. The research in this paper introduces detailed studies and meta-analysis findings, linking the outputs of our proposition to research from psychology. Specifically, it presents a framework to understand underlying principles of emotion recognition as functions of three separate setups in terms of modalities: audio only, video only, and the fusion of audio and video. Experimental results on two datasets show that the proposed framework achieves improved accuracy in emotion recognition, compared to state-of-the-art techniques and baseline methods not using attention mechanisms. The proposed method improves the results over baseline methods by at least 5.4%. Our experiments show that attention mechanisms reduce the gap between the entropies of unimodal predictions, which increases the bimodal predictions’ certainty and, therefore, improves the bimodal recognition rates. Furthermore, evaluations with noisy data in different scenarios are presented during the training and testing processes to check the framework’s consistency and the attention mechanism’s behavior. The results demonstrate that attention mechanisms increase the framework’s robustness when exposed to similar conditions during the training and the testing phases. Finally, we present comprehensive evaluations of emotion recognition as a function of time. The study shows that the middle time segments of a video clip are essential in the case of using audio modality. However, in the case of video modality, the importance of time windows is distributed equally.

Publisher

Springer Science and Business Media LLC

Subject

Computer Networks and Communications,Hardware and Architecture,Media Technology,Software

Link

https://link.springer.com/content/pdf/10.1007/s11042-022-13557-w.pdf

Reference41 articles.

1. Abu-El-Haija S, Kothari N, Lee J, Natsev A, Toderici G, Varadarajan B, Vijayanarasimhan S (2016) Youtube-8m: a large-scale video classification benchmark. arXiv:1609.08675

2. Afouras T et al (2018) Deep audio-visual speech recognition. IEEE Trans Pattern Anal Mach Intell

3. Albanie S, Vedaldi A (2016) Learning grimaces by watching tv BMVC

4. Athanasiadis C, Hortal E, Asteriadis S (2020) Audio–visual domain adaptation using conditional semi-supervised generative adversarial networks. Neurocomputing 397:331–344

5. Aubergé V, Cathiard M (2003) Can we hear the prosody of smile? Speech Commun 40(1-2):87–97. https://doi.org/10.1016/S0167-6393(02)00077-8

Cited by 10 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Optimized efficient attention-based network for facial expressions analysis in neurological health care;Computers in Biology and Medicine;2024-09

2. Multimodal Emotion Recognition with Deep Learning: Advancements, challenges, and future directions;Information Fusion;2024-05

3. Adaptive Speech Emotion Representation Learning Based On Dynamic Graph;ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP);2024-04-14

4. Exploring contactless techniques in multimodal emotion recognition: insights into diverse applications, challenges, solutions, and prospects;Multimedia Systems;2024-04-06

5. Deep Learning Approaches for Effective Human Computer Interaction: A Comprehensive Survey on Single and Multimodal Emotion Detection;2024 IEEE 9th International Conference for Convergence in Technology (I2CT);2024-04-05