Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech Recognition-Reference-Cited by-同舟云学术

Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech Recognition

Published:2021-12-23 Issue:1 Volume:22 Page:72
ISSN:1424-8220
Container-title:Sensors
language:en
Short-container-title:Sensors

Author:

Jeon Sanghun^ORCID,Elsharkawy Ahmed^ORCID,Kim Mun Sang

Abstract

In visual speech recognition (VSR), speech is transcribed using only visual information to interpret tongue and teeth movements. Recently, deep learning has shown outstanding performance in VSR, with accuracy exceeding that of lipreaders on benchmark datasets. However, several problems still exist when using VSR systems. A major challenge is the distinction of words with similar pronunciation, called homophones; these lead to word ambiguity. Another technical limitation of traditional VSR systems is that visual information does not provide sufficient data for learning words such as “a”, “an”, “eight”, and “bin” because their lengths are shorter than 0.02 s. This report proposes a novel lipreading architecture that combines three different convolutional neural networks (CNNs; a 3D CNN, a densely connected 3D CNN, and a multi-layer feature fusion 3D CNN), which are followed by a two-layer bi-directional gated recurrent unit. The entire network was trained using connectionist temporal classification. The results of the standard automatic speech recognition evaluation metrics show that the proposed architecture reduced the character and word error rates of the baseline model by 5.681% and 11.282%, respectively, for the unseen-speaker dataset. Our proposed architecture exhibits improved performance even when visual ambiguity arises, thereby increasing VSR reliability for practical applications.

Funder

National Research Foundation of Korea (NRF) grant funded by the Korea government

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry

Link

https://www.mdpi.com/1424-8220/22/1/72/pdf

Reference47 articles.

1. Hearing lips and seeing voices

2. Automatic visual speech recognition;Chitu,2012

3. Confusions Among Visually Perceived Consonants

4. Perceptual dominance during lipreading

Cited by 18 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. HNet: A deep learning based hybrid network for speaker dependent visual speech recognition;International Journal of Hybrid Intelligent Systems;2024-06-03

2. Deep hybrid architectures and DenseNet35 in speaker-dependent visual speech recognition;Signal, Image and Video Processing;2024-05-02

3. Classification of Landsat 8 Images Using Convolutional Neural Network Based on Minimum Noise Fraction Transform;2024 35th Conference of Open Innovations Association (FRUCT);2024-04-24

4. Speech recognition in digital videos without audio using convolutional neural networks;Journal of Intelligent & Fuzzy Systems;2024-03-23

5. Multimodal audiovisual speech recognition architecture using a three‐feature multi‐fusion method for noise‐robust systems;ETRI Journal;2024-02