Performance evaluation of automatic speech recognition systems on integrated noise-network distorted speech-Reference-Cited by-同舟云学术

Performance evaluation of automatic speech recognition systems on integrated noise-network distorted speech

Published:2022-09-21 Issue: Volume:2 Page:
ISSN:2673-8198
Container-title:Frontiers in Signal Processing
language:
Short-container-title:Front. Signal Process.

Author:

Kumalija Elhard,Nakamoto Yukikazu

Abstract

In VoIP applications, such as Interactive Voice Response and VoIP-phone conversation transcription, speech signals are degraded not only by environmental noise but also by transmission network quality, and distortions induced by encoding and decoding algorithms. Therefore, there is a need for automatic speech recognition (ASR) systems to handle integrated noise-network distorted speech. In this study, we present a comparative analysis of a speech-to-text system trained on clean speech against one trained on integrated noise-network distorted speech. Training an ASR model on noise-network distorted speech dataset improves its robustness. Although the performance of an ASR model trained on clean speech depends on noise type, this is not the case when noise is further distorted by network transmission. The model trained on noise-network distorted speech exhibited a 60% improvement rate in the word error rate (WER), word match rate (MER), and word information lost (WIL) over the model trained on clean speech. Furthermore, the ASR model trained with noise-network distorted speech could tolerate a jitter of less than 20% and a packet loss of less than 15%, without a decrease in performance. However, WER, MER, and WIL increased in proportion to the jitter and packet loss as they exceeded 20% and 15%, respectively. Additionally, the model trained on noise-network distorted speech exhibited higher robustness compared to that trained on clean speech. The ASR model trained on noise-network distorted speech can also tolerate signal-to-noise (SNR) values of 5 dB and above, without the loss of performance, independent of noise type.

Publisher

Frontiers Media SA

Reference33 articles.

1. Common voice: A massively-multilingual speech corpus;Ardila,2020

2. The third ‘chime’ speech separation and recognition challenge: Dataset, task and baselines;Barker,2015

3. The pascal chime speech separation and recognition challenge;Barker;Comput. Speech Lang.,2013

4. The fifth ’CHiME’ speech separation and recognition challenge: Dataset, task and baselines;Barker;Proc. Interspeech,2018

5. Ctimit: A speech corpus for the cellular environment with applications to automatic speech recognition;Brown;ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc.,1995

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Ensemble Machine Learning Approach for Parkinson’s Disease Detection Using Speech Signals;Mathematics;2024-05-18

2. Evaluating OpenAI's Whisper ASR: Performance analysis across diverse accents and speaker traits;JASA Express Letters;2024-02-01

3. Contextual Learning for Missing Speech Automatic Speech Recognition;2024 International Conference on Electronics, Information, and Communication (ICEIC);2024-01-28

4. MiniatureVQNet: A Light-Weight Deep Neural Network for Non-Intrusive Evaluation of VoIP Speech Quality;Applied Sciences;2023-02-14