Emotion Recognition from Videos Using Multimodal Large Language Models-Reference-Cited by-同舟云学术

Emotion Recognition from Videos Using Multimodal Large Language Models

Published:2024-07-13 Issue:7 Volume:16 Page:247
ISSN:1999-5903
Container-title:Future Internet
language:en
Short-container-title:Future Internet

Author:

Vaiani Lorenzo¹^ORCID,Cagliero Luca¹^ORCID,Garza Paolo¹^ORCID

Affiliation:

1. Dipartimento di Automatica e Informatica, Politecnico di Torino, Corso Duca degli Abruzzi, 24, 10129 Torino, Italy

Abstract

The diffusion of Multimodal Large Language Models (MLLMs) has opened new research directions in the context of video content understanding and classification. Emotion recognition from videos aims to automatically detect human emotions such as anxiety and fear. It requires deeply elaborating multiple data modalities, including acoustic and visual streams. State-of-the-art approaches leverage transformer-based architectures to combine multimodal sources. However, the impressive performance of MLLMs in content retrieval and generation offers new opportunities to extend the capabilities of existing emotion recognizers. This paper explores the performance of MLLMs in the emotion recognition task in a zero-shot learning setting. Furthermore, it presents a state-of-the-art architecture extension based on MLLM content reformulation. The performance achieved on the Hume-Reaction benchmark shows that MLLMs are still unable to outperform the state-of-the-art average performance but, notably, are more effective than traditional transformers in recognizing emotions with an intensity that deviates from the average of the samples.

Publisher

MDPI AG

Link

https://www.mdpi.com/1999-5903/16/7/247/pdf

Reference53 articles.

1. Bartolome, A., and Niu, S. (2023, January 23–28). A Literature Review of Video-Sharing Platform Research in HCI. Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, Hamburg, Germany. CHI ’23.

2. Cloud-Assisted Speech and Face Recognition Framework for Health Monitoring;Hossain;Mob. Netw. Appl.,2015

3. Cooperative learning and its application to emotion recognition from speech;Zhang;IEEE/ACM Trans. Audio, Speech Lang. Proc.,2015

4. Szwoch, M. (2015, January 29–30). Design Elements of Affect Aware Video Games. Proceedings of the Mulitimedia, Interaction, Design and Innnovation, Warsaw, Poland. MIDI ’15.

5. Christ, L., Amiriparian, S., Baird, A., Tzirakis, P., Kathan, A., Mueller, N., Stappen, L., Messner, E., König, A., and Cowen, A. (2022, January 10). The MuSe 2022 Multimodal Sentiment Analysis Challenge: Humor, Emotional Reactions, and Stress. Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge, Lisboa, Portugal.