Recognition of Emotions in Speech Using Convolutional Neural Networks on Different Datasets-Reference-Cited by-同舟云学术

Recognition of Emotions in Speech Using Convolutional Neural Networks on Different Datasets

Published:2022-11-21 Issue:22 Volume:11 Page:3831
ISSN:2079-9292
Container-title:Electronics
language:en
Short-container-title:Electronics

Author:

Zielonka Marta^ORCID,Piastowski Artur,Czyżewski Andrzej^ORCID,Nadachowski Paweł^ORCID,Operlejn Maksymilian^ORCID,Kaczor Kamil

Abstract

Artificial Neural Network (ANN) models, specifically Convolutional Neural Networks (CNN), were applied to extract emotions based on spectrograms and mel-spectrograms. This study uses spectrograms and mel-spectrograms to investigate which feature extraction method better represents emotions and how big the differences in efficiency are in this context. The conducted studies demonstrated that mel-spectrograms are a better-suited data type for training CNN-based speech emotion recognition (SER). The research experiments employed five popular datasets: Crowd-sourced Emotional Multimodal Actors Dataset (CREMA-D), Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), Surrey Audio-Visual Expressed Emotion (SAVEE), Toronto Emotional Speech Set (TESS), and The Interactive Emotional Dyadic Motion Capture (IEMOCAP). Six different classes of emotions were used: happiness, anger, sadness, fear, disgust, and neutral. However, some experiments were prepared to recognize just four emotions due to the characteristics of the IEMOCAP dataset. A comparison of classification efficiency on different datasets and an attempt to develop a universal model trained using all datasets were also performed. This approach brought an accuracy of 55.89% when recognizing four emotions. The most accurate model for six emotion recognition was trained and achieved 57.42% accuracy on a combination of four datasets (CREMA-D, RAVDESS, SAVEE, TESS). What is more, another study was developed that demonstrated that improper data division for training and test sets significantly influences the test accuracy of CNNs. Therefore, the problem of inappropriate data division between the training and test sets, which affected the results of studies known from the literature, was addressed extensively. The performed experiments employed the popular ResNet18 architecture to demonstrate the reliability of the research results and to show that these problems are not unique to the custom CNN architecture proposed in experiments. Subsequently, the label correctness of the CREMA-D dataset was studied through the employment of a prepared questionnaire.

Funder

Gdansk University of Technology. Internal

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering

Link

https://www.mdpi.com/2079-9292/11/22/3831/pdf

Reference43 articles.

1. Milner, R., Jalal, M.A., Ng, R.W.M., and Hain, T. (2019, January 14–18). A Cross-Corpus Study on Speech Emotion Recognition. Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Sentosa, Singapore.

2. Survey on speech emotion recognition: Features, classification schemes, and databases;Pattern Recognit.,2011

3. TTsouvalas, V., Ozcelebi, T., and Meratnia, N. (2022, January 21–25). Privacy-preserving Speech Emotion Recognition through Semi-Supervised Federated Learning. Proceedings of the 2022 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops), Pisa, Italy.

4. Deschamps-Berger, T., Lamel, L., and Devillers, L. (October, January 28). End-to-End Speech Emotion Recognition: Challenges of Real-Life Emergency Call Centers Data Recordings. Proceedings of the 2021 9th International Conference on Affective Computing and Intelligent Interaction (ACII), Nara, Japan.

5. Ristea, N.-C., and Ionescu, R.T. (2021). Self-Paced Ensemble Learning for Speech and Audio Classification. arXiv.

Cited by 12 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Utilizing robots for voice and sound analysis in therapy: enhancing emotional understanding in children with autism spectrum disorders;Journal of Modern Science;2024-08-20

2. Bark frequency cepstral coefficient based sadness emotion level recognition system;Computer Methods in Biomechanics and Biomedical Engineering;2024-06-19

3. Enhancing speech emotion recognition through deep learning and handcrafted feature fusion;Applied Acoustics;2024-06

4. Self-supervised Learning for Speech Emotion Recognition Task Using Audio-visual Features and Distil Hubert Model on BAVED and RAVDESS Databases;Journal of Systems Science and Systems Engineering;2024-05-29

5. Family Interactive Device with AI Emotion Sensing Technology;2024 IEEE 4th International Conference on Electronic Communications, Internet of Things and Big Data (ICEIB);2024-04-19