Adapting Pre-Trained Self-Supervised Learning Model for Speech Recognition with Light-Weight Adapters
-
Published:2024-01-01
Issue:1
Volume:13
Page:190
-
ISSN:2079-9292
-
Container-title:Electronics
-
language:en
-
Short-container-title:Electronics
Author:
Yue Xianghu12ORCID, Gao Xiaoxue1, Qian Xinyuan3ORCID, Li Haizhou124
Affiliation:
1. Department of Electrical and Computer Engineering, National University of Singapore, Singapore 117583, Singapore 2. School of Data Science, The Chinese University of Hong Kong, Shenzhen 518172, China 3. School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing 100083, China 4. Shenzhen Research Institute of Big Data, Shenzhen 518172, China
Abstract
Self-supervised learning (SSL) is an effective way of learning rich and transferable speech representations from unlabeled data to benefit downstream tasks. However, effectively incorporating a pre-trained SSL model into an automatic speech recognition (ASR) system remains challenging. In this paper, we propose a network architecture with light-weight adapters to adapt a pre-trained SSL model for an end-to-end (E2E) ASR. An adapter is introduced in each SSL network layer and trained on the downstream ASR task, while the parameters of the pre-trained SSL network layers remain unchanged. By carrying over all pre-trained parameters, we avoid the catastrophic forgetting problem. At the same time, we allow the network to quickly adapt to ASR task with light-weight adapters. The experiments using LibriSpeech and Wall Street Journal (WSJ) datasets show that (1) the proposed adapter-based fine-tuning consistently outperforms full-fledged training in low-resource scenarios, with up to 17.5%/12.2% relative word error rate (WER) reduction on the 10 min LibriSpeech split; (2) the adapter-based adaptation also shows competitive performance in high-resource scenarios, which further validates the effectiveness of the adapters.
Funder
CCF-Tencent Rhino-Bird Open Research Fund National Natural Science Foundation of China Guangdong Provincial Key Laboratory of Big Data Computing, The Chinese University of Hong Kong, Shenzhen
Reference60 articles.
1. Hybrid CTC/attention Architecture for End-to-End Speech Recognition;Watanabe;IEEE J. Sel. Top. Signal Process.,2017 2. Chan, W., Jaitly, N., Le, Q.V., and Vinyals, O. (2016, January 20–25). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China. 3. Graves, A., and Jaitly, N. (2014, January 21–26). Towards end-to-end speech recognition with recurrent neural networks. Proceedings of the International Conference on Machine Learning (ICML), Beijing, China. 4. Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., and Bengio, Y. (2015, January 7–12). Attention-based models for speech recognition. Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada. 5. Chiu, C.C., Sainath, T.N., Wu, Y., Prabhavalkar, R., Nguyen, P., and Chen, Z. (2018, January 15–20). State-of-the-Art Speech Recognition with Sequence-to-Sequence Models. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AL, Canada.
|
|