Three-stage training and orthogonality regularization for spoken language recognition-Reference-Cited by-同舟云学术

Three-stage training and orthogonality regularization for spoken language recognition

Published:2023-04-06 Issue:1 Volume:2023 Page:
ISSN:1687-4722
Container-title:EURASIP Journal on Audio, Speech, and Music Processing
language:en
Short-container-title:J AUDIO SPEECH MUSIC PROC.

Author:

Li Zimu,Xu Yanyan^ORCID,Ke Dengfeng,Su Kaile

Abstract

AbstractSpoken language recognition has made significant progress in recent years, for which automatic speech recognition has been used as a parallel branch to extract phonetic features. However, there is still a lack of a better training strategy for such architectures of two individual branches. In this paper, we analyze the mostly used two-stage training strategies and reveal a trade-off between the recognition accuracy and the generalization ability. Based on the analysis, we propose a three-stage training strategy and an orthogonality regularization method. The former adds a multi-task learning stage to the traditional two-stage training strategy to extract hybrid-level and noiseless features, which can improve the recognition accuracy on the basis of maintaining the generalization ability, while the latter constrains the orthogonality of base vectors and introduces prior knowledge to improve the recognition accuracy. Experiments on the Oriental Language Recognition (OLR) dataset indicate that these two proposed methods can improve both the language recognition accuracy and the generalization ability, especially in complex challenge tasks, such as cross-channel or noisy conditions. Also, our model, which combines these two proposed methods, performs better than the top three teams in the OLR20 challenge.

Funder

Fundamental Research Funds for the Central Universities

Publisher

Springer Science and Business Media LLC

Subject

Electrical and Electronic Engineering,Acoustics and Ultrasonics

Link

https://link.springer.com/content/pdf/10.1186/s13636-023-00281-y.pdf

Reference52 articles.

1. E. Ambikairajah, H. Li, L. Wang, B. Yin, V. Sethu, Language identification: a tutorial. Circ. Syst. Mag. IEEE. 11(2), 82–108 (2011)

2. A. Waibel, P. Geutner, L.M. Tomokiyo, T. Schultz, M. Woszczyna, Multilinguality in speech and spoken language systems. Proc. IEEE. 88(8), 1297–1313 (2000). https://doi.org/10.1109/5.880085

3. S. Punjabi, H. Arsikere, Z. Raeesy, C. Chandak, N. Bhave, A. Bansal, M. Müller, S. Murillo, A. Rastrow, S. Garimella, R. Maas, M. Hans, A. Mouchtaris, S. Kunzmann, Streaming end-to-end bilingual ASR systems with joint language identification. CoRR. abs/2007.03900 (2020). arXiv preprint arXiv:2007.03900

4. D. Snyder, D. Garcia-Romero, A. McCree, G. Sell, D. Povey, S. Khudanpur, in Odyssey 2018: The Speaker and Language Recognition Workshop, 26-29 June 2018, Les Sables d’Olonne, France, ed. by A. Larcher, J. Bonastre. Spoken language recognition using x-vectors (ISCA, 2018), pp. 105–111. https://doi.org/10.21437/Odyssey.2018-15

5. H. Li, B. Ma, K. Lee, Spoken language recognition: From fundamentals to practice. Proc. IEEE. 101(5), 1136–1159 (2013). https://doi.org/10.1109/JPROC.2012.2237151

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. PLDE: A lightweight pooling layer for spoken language recognition;Speech Communication;2024-03