Speech Emotion Recognition under Noisy Environments with SNR Down to −6 dB Using Multi-Decoder Wave-U-Net-Reference-Cited by-同舟云学术

Speech Emotion Recognition under Noisy Environments with SNR Down to −6 dB Using Multi-Decoder Wave-U-Net

Published:2024-06-17 Issue:12 Volume:14 Page:5227
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Nam Hyun-Joon¹,Park Hong-June¹

Affiliation:

1. Department of Electronic and Electrical Engineering, Pohang University of Science and Technology (POSTECH), Pohang 37673, Republic of Korea

Abstract

A speech emotion recognition (SER) model for noisy environments is proposed, by using four band-pass filtered speech waveforms as the model input instead of the simplified input features such as MFCC (Mel Frequency Cepstral Coefficients). The four waveforms retain the entire information of the original noisy speech while the simplified features keep only partial information of the noisy speech. The information reduction at the model input may cause the accuracy degradation under noisy environments. A normalized loss function is used for training to maintain the high-frequency details of the original noisy speech waveform. A multi-decoder Wave-U-Net model is used to perform the denoising operation and the Wave-U-Net output waveform is applied to an emotion classifier in this work. By this, the number of parameters is reduced to 2.8 M for inference from 4.2 M used for training. The Wave-U-Net model consists of an encoder, a 2-layer LSTM, six decoders, and skip-nets; out of the six decoders, four decoders are used for denoising four band-pass filtered waveforms, one decoder is used for denoising the pitch-related waveform, and one decoder is used to generate the emotion classifier input waveform. This work gives much less accuracy degradation than other SER works under noisy environments; compared to accuracy for the clean speech waveform, the accuracy degradation is 3.8% at 0 dB SNR in this work while it is larger than 15% in the other SER works. The accuracy degradation of this work at SNRs of 0 dB, −3 dB, and −6 dB is 3.8%, 5.2%, and 7.2%, respectively.

Funder

National Research Foundation of Korea

Publisher

MDPI AG

Link

https://www.mdpi.com/2076-3417/14/12/5227/pdf

Reference32 articles.

1. Dhuheir, M., Albaseer, A., Baccour, E., Erbad, A., Abdallah, M., and Hamdi, M. (July, January 28). Emotion Recognition for Healthcare Surveillance Systems Using Neural Networks: A Survey. Proceedings of the 2021 International Wireless Communications and Mobile Computing (IWCMC), Harbin, China.

2. Kularatne, B., Basnayake, B., Sathmini, P., Sewwandi, G., Rajapaksha, S., and Silva, D.D. (2022, January 7–9). Elderly Care Home Robot using Emotion Recognition, Voice Recognition and Medicine Scheduling. Proceedings of the 2022 7th International Conference on Information Technology Research (ICITR), Moratuwa, Sri Lanka.

3. Tacconi, D., Mayora, O., Lukowicz, P., Arnrich, B., Setz, C., Troster, G., and Haring, C. (February, January 30). Activity and emotion recognition to support early diagnosis of psychiatric diseases. Proceedings of the 2008 Second International Conference on Pervasive Computing Technologies for Healthcare, Tampere, Finland.

4. Using emotion recognition technologies to teach children with autism spectrum disorder how to identify and express emotions;Penichet;Univers. Access Inf. Soc.,2021

5. Giri, M., Bansal, M., Ramesh, A., Satvik, D., and D, U. (2023, January 7–9). Enhancing Safety in Vehicles using Emotion Recognition with Artificial Intelligence. Proceedings of the 2023 IEEE 8th International Conference for Convergence in Technology (I2CT), Lonavla, India.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Multimodal Emotion Recognition Using Visual, Vocal and Physiological Signals: A Review;Applied Sciences;2024-09-09