Environment-Aware Knowledge Distillation for Improved Resource-Constrained Edge Speech Recognition-Reference-Cited by-同舟云学术

Environment-Aware Knowledge Distillation for Improved Resource-Constrained Edge Speech Recognition

Published:2023-11-22 Issue:23 Volume:13 Page:12571
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Pimentel Arthur¹^ORCID,Guimarães Heitor R.¹^ORCID,Avila Anderson¹²^ORCID,Falk Tiago H.¹²^ORCID

Affiliation:

1. Institut National de la Recherche Scientifique (INRS-EMT), Université du Québec, Montreal, QC H5A 1K6, Canada

2. INRS-UQO Mixed Research Unit on Cybersecurity, Gatineau, QC J8X 3X7, Canada

Abstract

Recent advances in self-supervised learning have allowed automatic speech recognition (ASR) systems to achieve state-of-the-art (SOTA) word error rates (WER) while requiring only a fraction of the labeled data needed by its predecessors. Notwithstanding, while such models achieve SOTA results in matched train/test scenarios, their performance degrades substantially when tested in unseen conditions. To overcome this problem, strategies such as data augmentation and/or domain adaptation have been explored. Available models, however, are still too large to be considered for edge speech applications on resource-constrained devices; thus, model compression tools, such as knowledge distillation, are needed. In this paper, we propose three innovations on top of the existing DistilHuBERT distillation recipe: optimize the prediction heads, employ a targeted data augmentation method for different environmental scenarios, and employ a real-time environment estimator to choose between compressed models for inference. Experiments with the LibriSpeech dataset, corrupted with varying noise types and reverberation levels, show the proposed method outperforming several benchmark methods, both original and compressed, by as much as 48.4% and 89.2% in the word error reduction rate in extremely noisy and reverberant conditions, respectively, while reducing by 50% the number of parameters. Thus, the proposed method is well suited for resource-constrained edge speech recognition applications.

Funder

Natural Sciences and Engineering Research Council of Canada

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/13/23/12571/pdf

Reference35 articles.

1. O’shaughnessy, D. (1987). Speech Communications: Human and Machine (IEEE), Universities Press.

2. wav2vec 2.0: A framework for self-supervised learning of speech representations;Baevski;Adv. Neural Inf. Process. Syst.,2020

3. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units;Hsu;IEEE/ACM Trans. Audio Speech Lang. Process.,2021

4. Wavlm: Large-scale self-supervised pre-training for full stack speech processing;Chen;IEEE J. Sel. Top. Signal Process.,2022

5. Ao, J., Wang, R., Zhou, L., Wang, C., Ren, S., Wu, Y., Liu, S., Ko, T., Li, Q., and Zhang, Y. (2021). SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing. arXiv.