MelTrans: Mel-Spectrogram Relationship-Learning for Speech Emotion Recognition via Transformers-Reference-Cited by-同舟云学术

MelTrans: Mel-Spectrogram Relationship-Learning for Speech Emotion Recognition via Transformers

Published:2024-08-25 Issue:17 Volume:24 Page:5506
ISSN:1424-8220
Container-title:Sensors
language:en
Short-container-title:Sensors

Author:

Li Hui¹²,Li Jiawen²,Liu Hai²,Liu Tingting²^ORCID,Chen Qiang²,You Xinge¹

Affiliation:

1. School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan 430074, China

2. National Engineering Research Center for E-Learning, Central China Normal University, Wuhan 430079, China

Abstract

Speech emotion recognition (SER) is not only a ubiquitous aspect of everyday communication, but also a central focus in the field of human–computer interaction. However, SER faces several challenges, including difficulties in detecting subtle emotional nuances and the complicated task of recognizing speech emotions in noisy environments. To effectively address these challenges, we introduce a Transformer-based model called MelTrans, which is designed to distill critical clues from speech data by learning core features and long-range dependencies. At the heart of our approach is a dual-stream framework. Using the Transformer architecture as its foundation, MelTrans deciphers broad dependencies within speech mel-spectrograms, facilitating a nuanced understanding of emotional cues embedded in speech signals. Comprehensive experimental evaluations on the EmoDB (92.52%) and IEMOCAP (76.54%) datasets demonstrate the effectiveness of MelTrans. These results highlight MelTrans’s ability to capture critical cues and long-range dependencies in speech data, setting a new benchmark within the context of these specific datasets. These results highlight the effectiveness of the proposed model in addressing the complex challenges posed by SER tasks.

Funder

National Natural Science Foundation of Hubei Province project

Jiangxi Provincial Natural Science Foundation

university teaching reform research project of Jiangxi Province

Shenzhen Science and Technology Program

Publisher

MDPI AG

Link

https://www.mdpi.com/1424-8220/24/17/5506/pdf

Reference46 articles.

1. User Representations in Human-Computer Interaction;Seinfeld;Hum. -Comput. Interact.,2020

2. Semi-supervised cross-lingual speech emotion recognition;Agarla;Expert Syst. Appl.,2024

3. Gao, R., and Grauman, K. (2021, January 20–25). Visualvoice: Audio-visual speech separation with cross-modal consistency. Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA.

4. Acoustic feature selection for automatic emotion recognition from speech;Rong;Inf. Process. Manag.,2009

5. Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels;Wu;IEEE Trans. Affect. Comput.,2010