Speech Emotion Recognition Incorporating Relative Difficulty and Labeling Reliability-Reference-Cited by-同舟云学术

Speech Emotion Recognition Incorporating Relative Difficulty and Labeling Reliability

Published:2024-06-25 Issue:13 Volume:24 Page:4111
ISSN:1424-8220
Container-title:Sensors
language:en
Short-container-title:Sensors

Author:

Ahn Youngdo¹^ORCID,Han Sangwook¹^ORCID,Lee Seonggyu¹^ORCID,Shin Jong Won¹^ORCID

Affiliation:

1. School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Buk-gu, Gwangju 61005, Republic of Korea

Abstract

Emotions in speech are expressed in various ways, and the speech emotion recognition (SER) model may perform poorly on unseen corpora that contain different emotional factors from those expressed in training databases. To construct an SER model robust to unseen corpora, regularization approaches or metric losses have been studied. In this paper, we propose an SER method that incorporates relative difficulty and labeling reliability of each training sample. Inspired by the Proxy-Anchor loss, we propose a novel loss function which gives higher gradients to the samples for which the emotion labels are more difficult to estimate among those in the given minibatch. Since the annotators may label the emotion based on the emotional expression which resides in the conversational context or other modality but is not apparent in the given speech utterance, some of the emotional labels may not be reliable and these unreliable labels may affect the proposed loss function more severely. In this regard, we propose to apply label smoothing for the samples misclassified by a pre-trained SER model. Experimental results showed that the performance of the SER on unseen corpora was improved by adopting the proposed loss function with label smoothing on the misclassified data.

Funder

Korea governmen

Publisher

MDPI AG

Link

https://www.mdpi.com/1424-8220/24/13/4111/pdf

Reference43 articles.

1. Cai, X., Dai, D., Wu, Z., Li, X., Li, J., and Meng, H. (2021, January 6–11). Emotion controllable speech synthesis using emotion-unlabeled dataset with the assistance of cross-domain speech emotion recognition. Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada.

2. Generating natural language under pragmatic constraints;Hovy;J. Pragmat.,1987

3. Marsh, P.J., Polito, V., Singh, S., Coltheart, M., Langdon, R., and Harris, A.W. (2022). A quasi-randomized feasibility pilot study of specific treatments to improve emotion recognition and mental-state reasoning impairments in schizophrenia. BMC Psychiatry, 16.

4. Milner, R., Jalal, M.A., Ng, R.W., and Hain, T. (2019, January 14–18). A cross-corpus study on speech emotion recognition. Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore.

5. Parry, J., Palaz, D., Clarke, G., Lecomte, P., Mead, R., Berger, M., and Hofer, G. (2019, January 15–19). Analysis of Deep Learning Architectures for Cross-corpus Speech Emotion Recognition. Proceedings of the INTERSPEECH, Graz, Austria.