LWMD: A Comprehensive Compression Platform for End-to-End Automatic Speech Recognition Models
-
Published:2023-01-26
Issue:3
Volume:13
Page:1587
-
ISSN:2076-3417
-
Container-title:Applied Sciences
-
language:en
-
Short-container-title:Applied Sciences
Author:
Liu Yukun12, Li Ta12, Zhang Pengyuan12, Yan Yonghong123
Affiliation:
1. Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustic, Chinese Academy of Sciences, No. 21 North 4th Ring Road, Haidian District, Beijing 100190, China 2. University of Chinese Academy of Sciences, Beijing 101408, China 3. Xinjiang Laboratory of Minority Speech and Language Information Processing, Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, 40-1 South Beijing Road Urumqi, Urumqi 830011, China
Abstract
Recently end-to-end (E2E) automatic speech recognition (ASR) models have achieved promising performance. However, existing models tend to adopt increasing model sizes and suffer from expensive resource consumption for real-world applications. To compress E2E ASR models and obtain smaller model sizes, we propose a comprehensive compression platform named LWMD (light-weight model designing), which consists of two essential parts: a light-weight architecture search (LWAS) framework and a differentiable structured pruning (DSP) algorithm. On the one hand, the LWAS framework adopts the neural architecture search (NAS) technique to automatically search light-weight architectures for E2E ASR models. By integrating different architecture topologies of existing models together, LWAS designs a topology-fused search space. Furthermore, combined with the E2E ASR training criterion, LWAS develops a resource-aware search algorithm to select light-weight architectures from the search space. On the other hand, given the searched architectures, the DSP algorithm performs structured pruning to reduce parameter numbers further. With a Gumbel re-parameter trick, DSP builds a stronger correlation between the pruning criterion and the model performance than conventional pruning methods. And an attention-similarity loss function is further developed for better performance. On two mandarin datasets, Aishell-1 and HKUST, the compression results are well evaluated and analyzed to demonstrate the effectiveness of the LWMD platform.
Funder
National Key Research and Development Program of China Goal-Oriented Project Independently Deployed by Institute of Acoustics, Chinese Academy of Sciences
Subject
Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science
Reference35 articles.
1. Unsupervised and supervised VAD systems using combination of time and frequency domain features;Korkmaz;Biomed. Signal Process. Control,2020 2. milVAD: A bag-level MNIST modelling of voice activity detection using deep multiple instance learning;Korkmaz;Biomed. Signal Process. Control,2022 3. Hybrid voice activity detection system based on LSTM and auditory speech features;Korkmaz;Biomed. Signal Process. Control,2023 4. Hori, T., Watanabe, S., and Hershey, J.R. (August, January 30). Joint CTC/attention decoding for end-to-end speech recognition. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada. 5. Chiu, C.C., Sainath, T.N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., Kannan, A., Weiss, R.J., Rao, K., and Gonina, E. (2018, January 15–20). State-of-the-art speech recognition with sequence-to-sequence models. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada.
|
|