Author:
Kumalija Elhard,Nakamoto Yukikazu
Abstract
In VoIP applications, such as Interactive Voice Response and VoIP-phone conversation transcription, speech signals are degraded not only by environmental noise but also by transmission network quality, and distortions induced by encoding and decoding algorithms. Therefore, there is a need for automatic speech recognition (ASR) systems to handle integrated noise-network distorted speech. In this study, we present a comparative analysis of a speech-to-text system trained on clean speech against one trained on integrated noise-network distorted speech. Training an ASR model on noise-network distorted speech dataset improves its robustness. Although the performance of an ASR model trained on clean speech depends on noise type, this is not the case when noise is further distorted by network transmission. The model trained on noise-network distorted speech exhibited a 60% improvement rate in the word error rate (WER), word match rate (MER), and word information lost (WIL) over the model trained on clean speech. Furthermore, the ASR model trained with noise-network distorted speech could tolerate a jitter of less than 20% and a packet loss of less than 15%, without a decrease in performance. However, WER, MER, and WIL increased in proportion to the jitter and packet loss as they exceeded 20% and 15%, respectively. Additionally, the model trained on noise-network distorted speech exhibited higher robustness compared to that trained on clean speech. The ASR model trained on noise-network distorted speech can also tolerate signal-to-noise (SNR) values of 5 dB and above, without the loss of performance, independent of noise type.
Reference33 articles.
1. Common voice: A massively-multilingual speech corpus;Ardila,2020
2. The third ‘chime’ speech separation and recognition challenge: Dataset, task and baselines;Barker,2015
3. The pascal chime speech separation and recognition challenge;Barker;Comput. Speech Lang.,2013
4. The fifth ’CHiME’ speech separation and recognition challenge: Dataset, task and baselines;Barker;Proc. Interspeech,2018
5. Ctimit: A speech corpus for the cellular environment with applications to automatic speech recognition;Brown;ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc.,1995
Cited by
4 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献