Cluster-Based Pairwise Contrastive Loss for Noise-Robust Speech Recognition
Author:
Lee Geon Woo1ORCID, Kim Hong Kook123ORCID
Affiliation:
1. AI Graduate School, Gwangju Institute of Science and Technology, Gwangju 61005, Republic of Korea 2. School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju 61005, Republic of Korea 3. AunionAI Co., Ltd., Gwangju 61005, Republic of Korea
Abstract
This paper addresses a joint training approach applied to a pipeline comprising speech enhancement (SE) and automatic speech recognition (ASR) models, where an acoustic tokenizer is included in the pipeline to leverage the linguistic information from the ASR model to the SE model. The acoustic tokenizer takes the outputs of the ASR encoder and provides a pseudo-label through K-means clustering. To transfer the linguistic information, represented by pseudo-labels, from the acoustic tokenizer to the SE model, a cluster-based pairwise contrastive (CBPC) loss function is proposed, which is a self-supervised contrastive loss function, and combined with an information noise contrastive estimation (infoNCE) loss function. This combined loss function prevents the SE model from overfitting to outlier samples and represents the pronunciation variability in samples with the same pseudo-label. The effectiveness of the proposed CBPC loss function is evaluated on a noisy LibriSpeech dataset by measuring both the speech quality scores and the word error rate (WER). The experimental results reveal that the proposed joint training approach using the described CBPC loss function achieves a lower WER than the conventional joint training approaches. In addition, it is demonstrated that the speech quality scores of the SE model trained using the proposed training approach are higher than those of the standalone-SE model and SE models trained using conventional joint training approaches. An ablation study is also conducted to investigate the effects of different combinations of loss functions on the speech quality scores and WER. Here, it is revealed that the proposed CBPC loss function combined with infoNCE contributes to a reduced WER and an increase in most of the speech quality scores.
Funder
Institute of Information & communications Technology Planning & evaluation (IITP) grant, funded by the Korean government
Reference47 articles.
1. Supervised speech separation based on deep learning: An overview;Wang;IEEE/ACM Trans. Audio Speech Lang. Process.,2018 2. Nossier, S.A., Wall, J., Moniri, M., Glackin, C., and Cannings, N. (2021). An experimental analysis of deep learning architectures for supervised speech enhancement. Electronics, 10. 3. Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. (2023, January 17–23). Robust speech recognition via large-scale weak supervision. Proceedings of the International Conference on Machine Learning (ICML), Baltimore, MD, USA. Available online: https://proceedings.mlr.press/v202/radford23a/radford23a.pdf. 4. Caldarini, G., Jaf, S., and McGarry, K. (2022). A literature survey of recent advances in chatbots. Information, 13. 5. Performing predefined tasks using the human–robot interaction on speech recognition for an industrial robot;Bingol;Eng. Appl. Artif. Intell.,2020
|
|