Task-specific speech enhancement and data augmentation for improved multimodal emotion recognition under noisy conditions-Reference-Cited by-同舟云学术

Task-specific speech enhancement and data augmentation for improved multimodal emotion recognition under noisy conditions

Published:2023-03-28 Issue: Volume:5 Page:
ISSN:2624-9898
Container-title:Frontiers in Computer Science
language:
Short-container-title:Front. Comput. Sci.

Author:

Kshirsagar Shruti,Pendyala Anurag,Falk Tiago H.

Abstract

Automatic emotion recognition (AER) systems are burgeoning and systems based on either audio, video, text, or physiological signals have emerged. Multimodal systems, in turn, have shown to improve overall AER accuracy and to also provide some robustness against artifacts and missing data. Collecting multiple signal modalities, however, can be very intrusive, time consuming, and expensive. Recent advances in deep learning based speech-to-text and natural language processing systems, however, have enabled the development of reliable multimodal systems based on speech and text while only requiring the collection of audio data. Audio data, however, is extremely sensitive to environmental disturbances, such as additive noise, thus faces some challenges when deployed “in the wild.” To overcome this issue, speech enhancement algorithms have been deployed at the input signal level to improve testing accuracy in noisy conditions. Speech enhancement algorithms can come in different flavors and can be optimized for different tasks (e.g., for human perception vs. machine performance). Data augmentation, in turn, has also been deployed at the model level during training time to improve accuracy in noisy testing conditions. In this paper, we explore the combination of task-specific speech enhancement and data augmentation as a strategy to improve overall multimodal emotion recognition in noisy conditions. We show that AER accuracy under noisy conditions can be improved to levels close to those seen in clean conditions. When compared against a system without speech enhancement or data augmentation, an increase in AER accuracy of 40% was seen in a cross-corpus test, thus showing promising results for “in the wild” AER.

Publisher

Frontiers Media SA

Subject

Computer Science Applications,Computer Vision and Pattern Recognition,Human-Computer Interaction,Computer Science (miscellaneous)

Reference77 articles.

1. Philosophy of Language;Alston;J. Philos. Logic,1964

2. Feature pooling of modulation spectrum features for improved speech emotion recognition in the wild;Avila;IEEE Trans. Affect. Comput,2021

3. wav2vec 2.0: a framework for self-supervised learning of speech representations;Baevski;arXiv preprint,2020

4. “Spectral feature mapping with mimic loss for robust speech recognition,”;Bagchi,2018

5. Acoustic profiles in vocal emotion expression;Banse;J. Pers. Soc. Psychol,1996

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Speech emotion recognition based on multi-feature speed rate and LSTM;Neurocomputing;2024-10

2. A review of multimodal-based emotion recognition techniques for cyberbullying detection in online social media platforms;Neural Computing and Applications;2024-09-14

3. Environment-Aware Knowledge Distillation for Improved Resource-Constrained Edge Speech Recognition;Applied Sciences;2023-11-22