End-to-End Sentence-Level Multi-View Lipreading Architecture with Spatial Attention Module Integrated Multiple CNNs and Cascaded Local Self-Attention-CTC-Reference-Cited by-同舟云学术

End-to-End Sentence-Level Multi-View Lipreading Architecture with Spatial Attention Module Integrated Multiple CNNs and Cascaded Local Self-Attention-CTC

Published:2022-05-09 Issue:9 Volume:22 Page:3597
ISSN:1424-8220
Container-title:Sensors
language:en
Short-container-title:Sensors

Author:

Jeon Sanghun^ORCID,Kim Mun Sang^ORCID

Abstract

Concomitant with the recent advances in deep learning, automatic speech recognition and visual speech recognition (VSR) have received considerable attention. However, although VSR systems must identify speech from both frontal and profile faces in real-world scenarios, most VSR studies have focused solely on frontal face pictures. To address this issue, we propose an end-to-end sentence-level multi-view VSR architecture for faces captured from four different perspectives (frontal, 30°, 45°, and 60°). The encoder uses multiple convolutional neural networks with a spatial attention module to detect minor changes in the mouth patterns of similarly pronounced words, and the decoder uses cascaded local self-attention connectionist temporal classification to collect the details of local contextual information in the immediate vicinity, which results in a substantial performance boost and speedy convergence. To compare the performance of the proposed model for experiments on the OuluVS2 dataset, the dataset was divided into four different perspectives, and the obtained performance improvement was 3.31% (0°), 4.79% (30°), 5.51% (45°), 6.18% (60°), and 4.95% (mean), respectively, compared with the existing state-of-the-art performance, and the average performance improved by 9.1% compared with the baseline. Thus, the suggested design enhances the performance of multi-view VSR and boosts its usefulness in real-world applications.

Funder

National Research Foundation of Korea

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry

Link

https://www.mdpi.com/1424-8220/22/9/3597/pdf

Reference77 articles.

1. Comparison of Image Transform-Based Features for Visual Speech Recognition in Clean and Corrupted Videos

2. Audiovisual automatic speech recognition: Progress and challenges

3. A review of recent advances in visual speech decoding

4. Biometric Liveness Detection: Challenges and Research Opportunities

Cited by 7 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Deep hybrid architectures and DenseNet35 in speaker-dependent visual speech recognition;Signal, Image and Video Processing;2024-05-02

2. Speech recognition in digital videos without audio using convolutional neural networks;Journal of Intelligent & Fuzzy Systems;2024-03-23

3. Multimodal audiovisual speech recognition architecture using a three‐feature multi‐fusion method for noise‐robust systems;ETRI Journal;2024-02

4. LipSyncNet: A Novel Deep Learning Approach for Visual Speech Recognition in Audio-Challenged Situations;IEEE Access;2024

5. Data-Driven Advancements in Lip Motion Analysis: A Review;Electronics;2023-11-18