Abstract
AbstractLimited data availability remains a significant challenge for Whisper’s low-resource speech recognition performance, falling short of practical application requirements. While previous studies have successfully reduced the recognition error rates of target language speech through fine-tuning, a comprehensive exploration and analysis of Whisper’s fine-tuning capabilities and the advantages and disadvantages of various fine-tuning strategies are still lacking. This paper aims to fill this gap by conducting comprehensive experimental exploration for Whisper’s low-resource speech recognition performance using five fine-tuning strategies with limited supervised data from seven low-resource languages. The results and analysis demonstrate that all fine-tuning strategies explored in this paper significantly enhance Whisper’s performance. However, different strategies vary in their suitability and practical effectiveness, highlighting the need for careful selection based on specific use cases and resources available.
Funder
the National Natural Science Foundation of China
Natural Science Foundation of Henan Province of China
Henan Zhongyuan Science and Technology Innovation Leading Talent Project
Publisher
Springer Science and Business Media LLC
Reference36 articles.
1. A. Graves, A.R. Mohamed, G. Hinton, in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Speech Recognition with Deep Recurrent Neural Networks (2013), pp. 6645–6649. https://doi.org/10.1109/ICASSP.2013.6638947
2. W. Chan, N. Jaitly, Q. Le, O. Vinyals, in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition (2016), pp. 4960–4964. https://doi.org/10.1109/ICASSP.2016.7472621
3. Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDermott, S. Koo, S. Kumar, in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss (2020), pp. 7829–7833. https://doi.org/10.1109/ICASSP40776.2020.9053896
4. P. Karmakar, S.W. Teng, G. Lu, Thank you for attention: a survey on attention-based artificial neural networks for automatic speech recognition. CoRR abs/2102.07259 (2021). https://arxiv.org/abs/2102.07259. Accessed 5 Apr 2023
5. Y. Zhang, M. Pezeshki, P. Brakel, S. Zhang, C. Laurent, Y. Bengio, A.C. Courville, Towards end-to-end speech recognition with deep convolutional neural networks. CoRR abs/1701.02720 (2017). http://arxiv.org/abs/1701.02720. Accessed 1 Oct 2022