Speech Recognition of Accented Mandarin Based on Improved Conformer
Author:
Yang Xing-Yao1, Zhang Shao-Dong1, Xiao Rui1, Yu Jiong1, Li Zi-Yang1
Affiliation:
1. School of Software, Xinjiang University, 666, Shengli Road, Urumqi 830049, China
Abstract
The convolution module in Conformer is capable of providing translationally invariant convolution in time and space. This is often used in Mandarin recognition tasks to address the diversity of speech signals by treating the time-frequency maps of speech signals as images. However, convolutional networks are more effective in local feature modeling, while dialect recognition tasks require the extraction of a long sequence of contextual information features; therefore, the SE-Conformer-TCN is proposed in this paper. By embedding the squeeze-excitation block into the Conformer, the interdependence between the features of channels can be explicitly modeled to enhance the model’s ability to select interrelated channels, thus increasing the weight of effective speech spectrogram features and decreasing the weight of ineffective or less effective feature maps. The multi-head self-attention and temporal convolutional network is built in parallel, in which the dilated causal convolutions module can cover the input time series by increasing the expansion factor and convolutional kernel to capture the location information implied between the sequences and enhance the model’s access to location information. Experiments on four public datasets demonstrate that the proposed model has a higher performance for the recognition of Mandarin with an accent, and the sentence error rate is reduced by 2.1% compared to the Conformer, with only 4.9% character error rate.
Funder
National Natural Science Foundation of China Natural Science Foundation of Xinjiang Uygur Autonomous Region of China Education Department Project of Xinjiang Uygur Autonomous Region Doctor-al Research Start-up Foundation of Xinjiang University
Subject
Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry
Reference42 articles.
1. Recent advances in end-to-end automatic speech recognition;Li;APSIPA Trans. Signal Inf. Process.,2022 2. CTC regularized model adaptation for improving LSTM RNN based multi-accent mandarin speech recognition;Yi;J. Signal Process. Syst.,2018 3. Wang, Z., Schultz, T., and Waibel, A. (2003, January 6–10). Comparison of acoustic model adaptation techniques on non-native speech. Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP’03), Hong Kong, China. 4. Zheng, Y., Sproat, R., Gu, L., Shafran, I., Zhou, H., Su, Y., Jurafsky, D., Starr, R., and Yoon, S.-Y. (2005, January 4–8). Accent detection and speech recognition for shanghai-accented mandarin. Proceedings of the Ninth European Conference on Speech Communication and Technology, Lisbon, Portugal. 5. Chen, M., Yang, Z., Liang, J., Li, Y., and Liu, W. (2015, January 6–10). Improving deep neural networks based multi-accent mandarin speech recognition using i-vectors and accent-specific top layer. Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany.
|
|