Abstract
Speech emotion recognition (SER), a rapidly evolving task that aims to recognize the emotion of speakers, has become a key research area in affective computing. However, various languages in multilingual natural scenarios extremely challenge the generalization ability of SER, causing the model performance to decrease quickly, and driving researchers to ask how to improve the performance of multilingual SER. Recent studies mainly use feature fusion and language-controlled models to address this challenge, but key points such as the intrinsic association of languages or deep analysis of multilingual shared features (MSFs) are still neglected. To solve this problem, an explainable Multitask-based Shared Feature Learning (MSFL) model is proposed for multilingual SER. The introduction of multi-task learning (MTL) can provide related task information of language recognition for MSFL, improve its generalization in multilingual situations, and further lay the foundation for learning MSFs. Specifically, considering the generalization capability and interpretability of the model, the powerful MTL module was combined with the long short-term memory and attention mechanism, aiming to maintain the generalization in multilingual situations. Then, the feature weights acquired from the attention mechanism were ranked in descending order, and the top-ranked MSFs were compared with top-ranked monolingual features, enhancing the model interpretability based on the feature comparison. Various experiments were conducted on Emo-DB, CASIA, and SAVEE corpora from the model generalization and interpretability aspects. Experimental results indicate that MSFL performs better than most state-of-the-art models, with an average improvement of 3.37–4.49%. Besides, the top 10 features in MSFs almost contain the top-ranked features in three monolingual features, which effectively demonstrates the interpretability of MSFL.
Funder
Chinese National Social Science Foundation
Subject
Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science
Reference77 articles.
1. Dellaert, F., Polzin, T., and Waibel, A. (1996, January 3–6). Recognizing Emotion in Speech. Proceedings of the Fourth International Conference on Spoken Language Processing, ICSLP ’96, Philadelphia, PA, USA.
2. Classifying Emotions and Engagement in Online Learning Based on a Single Facial Expression Recognition Neural Network;Savchenko;IEEE Trans. Affect. Comput.,2022
3. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer;Raffel;J. Mach. Learn. Research.,2022
4. EEG-Based Emotion Recognition Using Regularized Graph Neural Networks;Zhong;IEEE Trans. Affect. Comput.,2022
5. Dimensional Speech Emotion Recognition Review;Li;Ruan Jian Xue Bao/J. Softw.,2020
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献