Cluster-Based Pairwise Contrastive Loss for Noise-Robust Speech Recognition

Author:

Lee Geon Woo1ORCID,Kim Hong Kook123ORCID

Affiliation:

1. AI Graduate School, Gwangju Institute of Science and Technology, Gwangju 61005, Republic of Korea

2. School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju 61005, Republic of Korea

3. AunionAI Co., Ltd., Gwangju 61005, Republic of Korea

Abstract

This paper addresses a joint training approach applied to a pipeline comprising speech enhancement (SE) and automatic speech recognition (ASR) models, where an acoustic tokenizer is included in the pipeline to leverage the linguistic information from the ASR model to the SE model. The acoustic tokenizer takes the outputs of the ASR encoder and provides a pseudo-label through K-means clustering. To transfer the linguistic information, represented by pseudo-labels, from the acoustic tokenizer to the SE model, a cluster-based pairwise contrastive (CBPC) loss function is proposed, which is a self-supervised contrastive loss function, and combined with an information noise contrastive estimation (infoNCE) loss function. This combined loss function prevents the SE model from overfitting to outlier samples and represents the pronunciation variability in samples with the same pseudo-label. The effectiveness of the proposed CBPC loss function is evaluated on a noisy LibriSpeech dataset by measuring both the speech quality scores and the word error rate (WER). The experimental results reveal that the proposed joint training approach using the described CBPC loss function achieves a lower WER than the conventional joint training approaches. In addition, it is demonstrated that the speech quality scores of the SE model trained using the proposed training approach are higher than those of the standalone-SE model and SE models trained using conventional joint training approaches. An ablation study is also conducted to investigate the effects of different combinations of loss functions on the speech quality scores and WER. Here, it is revealed that the proposed CBPC loss function combined with infoNCE contributes to a reduced WER and an increase in most of the speech quality scores.

Funder

Institute of Information & communications Technology Planning & evaluation (IITP) grant, funded by the Korean government

Publisher

MDPI AG

Reference47 articles.

1. Supervised speech separation based on deep learning: An overview;Wang;IEEE/ACM Trans. Audio Speech Lang. Process.,2018

2. Nossier, S.A., Wall, J., Moniri, M., Glackin, C., and Cannings, N. (2021). An experimental analysis of deep learning architectures for supervised speech enhancement. Electronics, 10.

3. Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. (2023, January 17–23). Robust speech recognition via large-scale weak supervision. Proceedings of the International Conference on Machine Learning (ICML), Baltimore, MD, USA. Available online: https://proceedings.mlr.press/v202/radford23a/radford23a.pdf.

4. Caldarini, G., Jaf, S., and McGarry, K. (2022). A literature survey of recent advances in chatbots. Information, 13.

5. Performing predefined tasks using the human–robot interaction on speech recognition for an industrial robot;Bingol;Eng. Appl. Artif. Intell.,2020

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3