Audio–Visual Speech Recognition Based on Dual Cross-Modality Attentions with the Transformer Model-Reference-Cited by-同舟云学术

Audio–Visual Speech Recognition Based on Dual Cross-Modality Attentions with the Transformer Model

Published:2020-10-17 Issue:20 Volume:10 Page:7263
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Lee Yong-Hyeok^ORCID,Jang Dong-Won^ORCID,Kim Jae-Bin^ORCID,Park Rae-Hong^ORCID,Park Hyung-Min^ORCID

Abstract

Since attention mechanism was introduced in neural machine translation, attention has been combined with the long short-term memory (LSTM) or replaced the LSTM in a transformer model to overcome the sequence-to-sequence (seq2seq) problems with the LSTM. In contrast to the neural machine translation, audio–visual speech recognition (AVSR) may provide improved performance by learning the correlation between audio and visual modalities. As a result that the audio has richer information than the video related to lips, AVSR is hard to train attentions with balanced modalities. In order to increase the role of visual modality to a level of audio modality by fully exploiting input information in learning attentions, we propose a dual cross-modality (DCM) attention scheme that utilizes both an audio context vector using video query and a video context vector using audio query. Furthermore, we introduce a connectionist-temporal-classification (CTC) loss in combination with our attention-based model to force monotonic alignments required in AVSR. Recognition experiments on LRS2-BBC and LRS3-TED datasets showed that the proposed model with the DCM attention scheme and the hybrid CTC/attention architecture achieved at least a relative improvement of 7.3% on average in the word error rate (WER) compared to competing methods based on the transformer model.

Funder

National Research Foundation of Korea

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/10/20/7263/pdf

Reference44 articles.

1. Techniques for Noise Robustness in Automatic Speech Recognition,2012

2. Distant Speech Recognition;Wölfel,2009

3. Environmental Robustness;Droppo,2008

4. Bayesian feature enhancement using independent vector analysis and reverberation parameter re-estimation for noisy reverberant speech recognition

Cited by 13 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Audio–visual speech recognition based on regulated transformer and spatio–temporal fusion strategy for driver assistive systems;Expert Systems with Applications;2024-10

2. Enhancing Lip Reading with Multi-Scale Video and Multi-Encoder;2024 IEEE International Conference on Multimedia and Expo Workshops (ICMEW);2024-07-15

3. A classification method of marine mammal calls based on two-channel fusion network;Applied Intelligence;2024-02

4. Lip Segmentation for Visual Speech Recognition Based on the Convolution Process;2023 International Conference on Engineering Applied and Nano Sciences (ICEANS);2023-10-25

5. Beyond Conversational Discourse: A Framework for Collaborative Dialogue Analysis;Proceedings of the 7th International Conference on Computer Science and Application Engineering;2023-10-17