Voice Keyword Retrieval Method Using Attention Mechanism and Multimodal Information Fusion-Reference-Cited by-同舟云学术

Voice Keyword Retrieval Method Using Attention Mechanism and Multimodal Information Fusion

Published:2021-01-23 Issue: Volume:2021 Page:1-11
ISSN:1875-919X
Container-title:Scientific Programming
language:en
Short-container-title:Scientific Programming

Author:

Zhang Hongli¹^ORCID

Affiliation:

1. Department of Educational Technology, Inner Mongolia Normal University, Inner Mongolia, Hohhot 010022, China

Abstract

A cross-modal speech-text retrieval method using interactive learning convolution automatic encoder (CAE) is proposed. First, an interactive learning autoencoder structure is proposed, including two inputs of speech and text, as well as processing links such as encoding, hidden layer interaction, and decoding, to complete the modeling of cross-modal speech-text retrieval. Then, the original audio signal is preprocessed and the Mel frequency cepstrum coefficient (MFCC) feature is extracted. In addition, the word bag model is used to extract the text features, and then the attention mechanism is used to combine the text and speech features. Through interactive learning CAE, the shared features of speech and text modes are obtained and then sent to modal classifier to identify modal information, so as to realize cross-modal voice text retrieval. Finally, experiments show that the performance of the proposed algorithm is better than that of the contrast algorithm in terms of recall rate, accuracy rate, and false recognition rate.

Funder

Key Technology Project of Inner Mongolia Autonomous Region

Publisher

Hindawi Limited

Subject

Computer Science Applications,Software

Link

http://downloads.hindawi.com/journals/sp/2021/6662841.pdf

Reference29 articles.

1. SMS versus voice messaging to deliver MNCH communication in rural Malawi: assessment of delivery success and user experience

2. How do users respond to voice input errors? lexical and phonetic query reformulation in voice search;J. Jiang

3. Cross-modal interactions between human faces and voices involved in person recognition