Environment-Aware Knowledge Distillation for Improved Resource-Constrained Edge Speech Recognition

Author:

Pimentel Arthur1ORCID,Guimarães Heitor R.1ORCID,Avila Anderson12ORCID,Falk Tiago H.12ORCID

Affiliation:

1. Institut National de la Recherche Scientifique (INRS-EMT), Université du Québec, Montreal, QC H5A 1K6, Canada

2. INRS-UQO Mixed Research Unit on Cybersecurity, Gatineau, QC J8X 3X7, Canada

Abstract

Recent advances in self-supervised learning have allowed automatic speech recognition (ASR) systems to achieve state-of-the-art (SOTA) word error rates (WER) while requiring only a fraction of the labeled data needed by its predecessors. Notwithstanding, while such models achieve SOTA results in matched train/test scenarios, their performance degrades substantially when tested in unseen conditions. To overcome this problem, strategies such as data augmentation and/or domain adaptation have been explored. Available models, however, are still too large to be considered for edge speech applications on resource-constrained devices; thus, model compression tools, such as knowledge distillation, are needed. In this paper, we propose three innovations on top of the existing DistilHuBERT distillation recipe: optimize the prediction heads, employ a targeted data augmentation method for different environmental scenarios, and employ a real-time environment estimator to choose between compressed models for inference. Experiments with the LibriSpeech dataset, corrupted with varying noise types and reverberation levels, show the proposed method outperforming several benchmark methods, both original and compressed, by as much as 48.4% and 89.2% in the word error reduction rate in extremely noisy and reverberant conditions, respectively, while reducing by 50% the number of parameters. Thus, the proposed method is well suited for resource-constrained edge speech recognition applications.

Funder

Natural Sciences and Engineering Research Council of Canada

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Reference35 articles.

1. O’shaughnessy, D. (1987). Speech Communications: Human and Machine (IEEE), Universities Press.

2. wav2vec 2.0: A framework for self-supervised learning of speech representations;Baevski;Adv. Neural Inf. Process. Syst.,2020

3. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units;Hsu;IEEE/ACM Trans. Audio Speech Lang. Process.,2021

4. Wavlm: Large-scale self-supervised pre-training for full stack speech processing;Chen;IEEE J. Sel. Top. Signal Process.,2022

5. Ao, J., Wang, R., Zhou, L., Wang, C., Ren, S., Wu, Y., Liu, S., Ko, T., Li, Q., and Zhang, Y. (2021). SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing. arXiv.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3