Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition-Reference-Cited by-同舟云学术

Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition

Published:2022-07-23 Issue:15 Volume:22 Page:5501
ISSN:1424-8220
Container-title:Sensors
language:en
Short-container-title:Sensors

Author:

Yu Wentao,Zeiler Steffen,Kolossa Dorothea^ORCID

Abstract

Audio-visual speech recognition (AVSR) can significantly improve performance over audio-only recognition for small or medium vocabularies. However, current AVSR, whether hybrid or end-to-end (E2E), still does not appear to make optimal use of this secondary information stream as the performance is still clearly diminished in noisy conditions for large-vocabulary systems. We, therefore, propose a new fusion architecture—the decision fusion net (DFN). A broad range of time-variant reliability measures are used as an auxiliary input to improve performance. The DFN is used in both hybrid and E2E models. Our experiments on two large-vocabulary datasets, the Lip Reading Sentences 2 and 3 (LRS2 and LRS3) corpora, show highly significant improvements in performance over previous AVSR systems for large-vocabulary datasets. The hybrid model with the proposed DFN integration component even outperforms oracle dynamic stream-weighting, which is considered to be the theoretical upper bound for conventional dynamic stream-weighting approaches. Compared to the hybrid audio-only model, the proposed DFN achieves a relative word-error-rate reduction of 51% on average, while the E2E-DFN model, with its more competitive audio-only baseline system, achieves a relative word error rate reduction of 43%, both showing the efficacy of our proposed fusion architecture.

Funder

Deutsche Forschungsgemeinschaft

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry

Link

https://www.mdpi.com/1424-8220/22/15/5501/pdf

Reference55 articles.

1. Eye Can Hear Clearly Now: Inverse Effectiveness in Natural Audiovisual Speech Processing Relies on Long-Term Crossmodal Temporal Integration

2. Hearing lips and seeing voices

3. Audio-Visual Automatic Speech Recognition: An Overview. Issues in Visual and Audio-Visual Speech Processing;Potamianos,2004

4. Improving speaker-independent lipreading with domain-adversarial training;Wand;arXiv,2017

5. Improving audio-visual speech recognition using deep neural networks with dynamic stream reliability estimates;Meutzner;Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),2017

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Human-inspired computational models for European Portuguese: a review;Language Resources and Evaluation;2023-05-03