Low-Power Feature-Attention Chinese Keyword Spotting Framework with Distillation Learning-Reference-Cited by-同舟云学术

Low-Power Feature-Attention Chinese Keyword Spotting Framework with Distillation Learning

Published:2022-12-27 Issue:2 Volume:22 Page:1-14
ISSN:2375-4699
Container-title:ACM Transactions on Asian and Low-Resource Language Information Processing
language:en
Short-container-title:ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Author:

Lei Lei¹^ORCID,Yuan Guoshun²^ORCID,Zhang Tianle³^ORCID,Yu Hongjiang¹^ORCID

Affiliation:

1. Institute of Microelectronics of Chinese Academy of Sciences, Beijing, China and University of Chinese Academy of Sciences, Beijing, China

2. Institute of Microelectronics of Chinese Academy of Sciences, Beijing, China

3. Institute of Automation, Chinese Academy of Sciences, China and School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China

Abstract

In this paper, we propose a novel Low-Power Feature-Attention Chinese Keyword Spotting Framework based on a depthwise separable convolution neural network (DSCNN) with distillation learning to recognize speech signals of Chinese wake-up words. The framework consists of a low-power feature-attention acoustic model and its learning methods. Different from the existing model, the proposed acoustic model based on connectionist temporal classification (CTC) focuses on the reduction of power consumption by reducing model network parameters and multiply-accumulate (MAC) operations through our designed feature-attention network and DSCNN. In particular, the feature-attention network is specially designed to extract effective syllable features from a large number of MFCC features. This could refine MFCC features by selectively focusing on important speech signal features and removing invalid speech signal features to reduce the number of speech signal features, which helps to significantly reduce the parameters and MAC operations of the whole acoustic model. Moreover, DSCNN with fewer parameters and MAC operations compared with traditional convolution neural networks is adopted to extract effective high-dimensional features from syllable features. Furthermore, we apply a distillation learning algorithm to efficiently train the proposed low-power acoustic model by utilizing the knowledge of the trained large acoustic model. Experimental results thoroughly verify the effectiveness of our model and show that the proposed acoustic model still has better accuracy than other acoustic models with the lowest power consumption and smallest latency measured by NVIDIA JETSON TX2. It has only 14.524 KB parameters and consumes only 0.141 J energy per query and 17.9 ms latency on the platform, which is hardware-friendly.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3558002

Reference37 articles.

1. 2021. Jetson TX2 module. https://developer.nvidia.com/embedded/jetson-tx2.

2. Convolutional recurrent neural networks for small-footprint keyword spotting;Arik Sercan O.;arXiv preprint arXiv:1703.05390,2017

3. Bi-directional Long Short-Term Memory Model with Semantic Positional Attention for the Question Answering System

4. Large vocabulary Mandarin speech recognition with different approaches in modeling tones

5. Small-footprint keyword spotting using deep neural networks