A Neural Network Architecture for Children’s Audio–Visual Emotion Recognition-Reference-Cited by-同舟云学术

A Neural Network Architecture for Children’s Audio–Visual Emotion Recognition

Published:2023-11-07 Issue:22 Volume:11 Page:4573
ISSN:2227-7390
Container-title:Mathematics
language:en
Short-container-title:Mathematics

Author:

Matveev Anton¹^ORCID,Matveev Yuri¹,Frolova Olga¹,Nikolaev Aleksandr¹,Lyakso Elena¹^ORCID

Affiliation:

1. Child Speech Research Group, Department of Higher Nervous Activity and Psychophysiology, St. Petersburg University, St. Petersburg 199034, Russia

Abstract

Detecting and understanding emotions are critical for our daily activities. As emotion recognition (ER) systems develop, we start looking at more difficult cases than just acted adult audio–visual speech. In this work, we investigate the automatic classification of the audio–visual emotional speech of children, which presents several challenges including the lack of publicly available annotated datasets and the low performance of the state-of-the art audio–visual ER systems. In this paper, we present a new corpus of children’s audio–visual emotional speech that we collected. Then, we propose a neural network solution that improves the utilization of the temporal relationships between audio and video modalities in the cross-modal fusion for children’s audio–visual emotion recognition. We select a state-of-the-art neural network architecture as a baseline and present several modifications focused on a deeper learning of the cross-modal temporal relationships using attention. By conducting experiments with our proposed approach and the selected baseline model, we observe a relative improvement in performance by 2%. Finally, we conclude that focusing more on the cross-modal temporal relationships may be beneficial for building ER systems for child–machine communications and environments where qualified professionals work with children.

Funder

Russian Science Foundation

Publisher

MDPI AG

Subject

General Mathematics,Engineering (miscellaneous),Computer Science (miscellaneous)

Link

https://www.mdpi.com/2227-7390/11/22/4573/pdf

Reference63 articles.

1. Speech Emotion Recognition: Two Decades in a Nutshell, Benchmarks, and Ongoing Trends;Schuller;Commun. ACM,2018

2. Speech Emotion Recognition Using Deep Learning Techniques: A Review;Khalil;IEEE Access,2019

3. Approbation of a method for studying the reflection of emotional state in children’s speech and pilot psychophysiological experimental data;Lyakso;Int. J. Adv. Trends Comput. Sci. Eng.,2020

4. Onwujekwe, D. (2021). Using Deep Leaning-Based Framework for Child Speech Emotion Recognition. [Ph.D. Thesis, Virginia Commonwealth University]. Available online: https://scholarscompass.vcu.edu/cgi/viewcontent.cgi?article=7859&context=etd.

5. Guran, A.-M., Cojocar, G.-S., and Diosan, L.-S. (2022). The Next Generation of Edutainment Applications for Young Children—A Proposal. Mathematics, 10.