Audio–visual keyword transformer for unconstrained sentence‐level keyword spotting-Reference-Cited by-同舟云学术

Audio–visual keyword transformer for unconstrained sentence‐level keyword spotting

Published:2023-03-20 Issue: Volume: Page:
ISSN:2468-2322
Container-title:CAAI Transactions on Intelligence Technology
language:en
Short-container-title:CAAI Trans on Intel Tech

Author:

Li Yidi¹^ORCID,Ren Jiale¹²,Wang Yawei¹,Wang Guoquan¹,Li Xia³,Liu Hong¹

Affiliation:

1. Key Laboratory of Machine Perception Peking University Shenzhen Graduate School Shenzhen China

2. College of Electronics and Information Engineering Sichuan University Chengdu China

3. Department of Computer Science ETH Zurich Zurich Switzerland

Abstract

AbstractAs one of the most effective methods to improve the accuracy and robustness of speech tasks, the audio–visual fusion approach has recently been introduced into the field of Keyword Spotting (KWS). However, existing audio–visual keyword spotting models are limited to detecting isolated words, while keyword spotting for unconstrained speech is still a challenging problem. To this end, an Audio–Visual Keyword Transformer (AVKT) network is proposed to spot keywords in unconstrained video clips. The authors present a transformer classifier with learnable CLS tokens to extract distinctive keyword features from the variable‐length audio and visual inputs. The outputs of audio and visual branches are combined in a decision fusion module. As humans can easily notice whether a keyword appears in a sentence or not, our AVKT network can detect whether a video clip with a spoken sentence contains a pre‐specified keyword. Moreover, the position of the keyword is localised in the attention map without additional position labels. Experimental results on the LRS2‐KWS dataset and our newly collected PKU‐KWS dataset show that the accuracy of AVKT exceeded 99% in clean scenes and 85% in extremely noisy conditions. The code is available at https://github.com/jialeren/AVKT.

Funder

National Natural Science Foundation of China

Publisher

Institution of Engineering and Technology (IET)

Subject

Artificial Intelligence,Computer Networks and Communications,Computer Vision and Pattern Recognition,Human-Computer Interaction,Information Systems

Link

https://onlinelibrary.wiley.com/doi/pdf/10.1049/cit2.12212

Reference64 articles.

1. Human-Human Communication in Cyber Threat Situations: A Systematic Review

2. Alexa, Siri, Cortana, and More: An Introduction to Voice Assistants

3. A Study on Current States and Required Technologies of Smart Speaker in Service Robot

4. Arik S.O. et al.:Convolutional recurrent neural networks for small‐footprint keyword spotting(2017).arXiv preprint arXiv:1703.05390

5. Warden P.:Speech commands: a dataset for limited‐vocabulary speech recognition(2018).arXiv preprint arXiv:1804.03209

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Improvement of Audio-Visual Keyword Spotting System Accuracy Using Excitation Source Feature;Speech and Computer;2023