Speech enhancement augmentation for robust speech recognition in noisy environments
-
Published:2024
Issue:
Volume:59
Page:04003
-
ISSN:2271-2097
-
Container-title:ITM Web of Conferences
-
language:
-
Short-container-title:ITM Web Conf.
Author:
Nasretdinov Rauf,Lependin Andrey,Ilyashenko Ilya
Abstract
Abstract. The use of augmentations as a data enrichment method has become an important element in improving the performance of speech recognition systems. To work effectively in noisy conditions, augmentation is usually used to simulate the presence of background noise. However, the quality of speech recognition on samples pre-processed by noise reduction models does not increase. This paper proposes a new approach to speech data augmentation when training ASR systems, intended for their joint use with models for speech enhancement. It was based on the creation of several additional data samples containing speech samples processed by the enhancement model. The proposed approach was tested on the E-Branchformer neural network model using data from the Librispeech set. The quality of speech samples was assessed using the DNSMOS metric. By means of a 100-hour sample of clean speech samples it was shown that the proposed augmentation allows for an improvement in the WER metric of more than 4% in absolute value compared to the generally accepted approach based on adding noisy speech samples. Experiments on 960-hour data demonstrated the robustness of this approach as the training set size increased.
Reference26 articles.
1. Jaitly N., Hinton G. E., Vocal tract length perturbation (VTLP) improves speech recognition, in Proceedings of the International Conference on Machine Learning, ICML, Workshop on Deep Learning for Audio, Speech, and Language Processing, 2021 June 2013, Atlanta, USA (2013) 2. Ko T., Peddinti V., Povey D., Khudanpur S., Audio Augmentation for Speech Recognition, in Proceedings of the Interspeech, 6-10 September 2015, Dresden, Germany (2015) 3. Park D. S., Chan W., Zhang Y., Chiu C., Zoph B., Cubuk E. D., Specaugment: A simple data augmentation method for automatic speech recognition, in Proceedings of the Interspeech, 15-19 September 2019, Graz, Austria (2019) 4. Panayotov V., Chen G., Povey D., Khudanpur S., LibriSpeech: An ASR corpus based on public domain audio books, in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, ICASSP, 19-24 April 2015, Brisbane, Queensland (2015) 5. Rosenberg A., Zhang Y., Ramabhadran B., Jia Y., Moreno P., Wu Y., Wu Z., Speech recognition with augmented synthesized speech, in Proceedings of the IEEE automatic speech recognition and understanding workshop, ASRU, 14-18 December 2019, Sentosa, Singapore (2019)
|
|