Speech Keyword Spotting Method Based on Swin-Transformer Model-Reference-Cited by-同舟云学术

Speech Keyword Spotting Method Based on Swin-Transformer Model

Published:2024-03-27 Issue:1 Volume:17 Page:
ISSN:1875-6883
Container-title:International Journal of Computational Intelligence Systems
language:en
Short-container-title:Int J Comput Intell Syst

Author:

Sun Chengli^ORCID,Chen Bikang,Chen Feilong,Leng Yan,Guo Qiaosheng

Abstract

AbstractWith the rapid advancements in deep learning technology, the Transformer-based attention neural network has shown promising performance in keyword spotting (KWS). However, this method suffers from high computational cost since the excessive parameters in the Transformer model and the computational burden of global attention, which limit its applicability in a resource-constrained KWS scenario. To overcome this issue, we propose a novel Swin-Transformer based KWS method. In this approach, first extract dynamic features using Temporal Convolutional Network (TCN) from input Mel-Frequency Cepstral Coefficients (MFCCs). Then, the Swin-Transformer is employed to capture hierarchical multi-scale features, where a window attention is designed to grasp dynamic time–frequency features. Furthermore, to enhance the extraction of contextual information from the spectrogram, a frame-level shifted window attention mechanism is proposed to enhance the inter-window interaction, thus extracting more contextual information from the spectrogram. Experimental results on the speech command V1 dataset verify the effectiveness of the proposal, which achieves a recognition accuracy of 98.01% with less model parameters, outperforming existing KWS methods.

Funder

National Natural Science Foundation of China

State Key Laboratory of Food Science and Technology, Nanchang University

Natural Science Foundation of Jiangxi Province

Natural Science Foundation of Shandong Province

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s44196-024-00448-1.pdf

Reference23 articles.

1. Chengli Sun. Research on speech keyword recognition technology [D]. (2008)

2. Dridi, H., Ouni, K.: Hybrid context dependent CD-DNN-HMM keywords spotting on continuous speech[C]. 2017 International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), 1–7 (2017)

3. Wang, D., Li, T., Deng, P., et al.: A generalized deep learning algorithm based on NMF for multi-view clustering[J]. IEEE Trans. Big Data 9(1), 328–340 (2023)

4. Wang, D., Li, T., Deng, P., et al.: A generalized deep learning clustering algorithm based on non-negative matrix factorization[J]. ACM Trans. Knowl. Discover. Data 17, 1–20 (2023)

5. Liu, Z., Li, T., Zhang, P.: RNN-T Based open-vocabulary keyword spotting in mandarin with multi-level detection[C]. ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5649–5653 (2021)