Abstract
AbstractWith the rapid advancements in deep learning technology, the Transformer-based attention neural network has shown promising performance in keyword spotting (KWS). However, this method suffers from high computational cost since the excessive parameters in the Transformer model and the computational burden of global attention, which limit its applicability in a resource-constrained KWS scenario. To overcome this issue, we propose a novel Swin-Transformer based KWS method. In this approach, first extract dynamic features using Temporal Convolutional Network (TCN) from input Mel-Frequency Cepstral Coefficients (MFCCs). Then, the Swin-Transformer is employed to capture hierarchical multi-scale features, where a window attention is designed to grasp dynamic time–frequency features. Furthermore, to enhance the extraction of contextual information from the spectrogram, a frame-level shifted window attention mechanism is proposed to enhance the inter-window interaction, thus extracting more contextual information from the spectrogram. Experimental results on the speech command V1 dataset verify the effectiveness of the proposal, which achieves a recognition accuracy of 98.01% with less model parameters, outperforming existing KWS methods.
Funder
National Natural Science Foundation of China
State Key Laboratory of Food Science and Technology, Nanchang University
Natural Science Foundation of Jiangxi Province
Natural Science Foundation of Shandong Province
Publisher
Springer Science and Business Media LLC