HNet: A deep learning based hybrid network for speaker dependent visual speech recognition-Reference-Cited by-同舟云学术

HNet: A deep learning based hybrid network for speaker dependent visual speech recognition

Published:2024-06-03 Issue: Volume: Page:1-17
ISSN:1448-5869
Container-title:International Journal of Hybrid Intelligent Systems
language:
Short-container-title:HIS

Author:

Chandrabanshi Vishnu,Domnic S.

Abstract

Visual Speech Recognition (VSR) is a popular area in computer vision research, attracting interest for its ability to precisely analyze lip motion and seamlessly convert them into textual representation. VSR systems leverage visual features to augment the understanding of automated speech and predict text. VSR finds various applications, including enhancing speech recognition in scenarios with degraded acoustic signals, aiding individuals with hearing impairments, bolstering security by reducing reliance on text based passwords, facilitating biometric authentication for liveness detection, and enabling underwater communications. Despite the various techniques proposed for improving the resilience and precision of automatic speech recognition, VSR still has challenges like homophones of words, gradient descent issues with varying sequence lengths, and lip reading demands accounting for short and long range correlations between consecutive video frames. We have proposed a hybrid network (HNet) with multilayered three dimensional dilated convolution neural network (3D-CNN). The spatio-temporal feature extraction process will be facilitated by dilated 3D-CNN. HNet integrates two bidirectional recurrent neural networks (BiGRU and BiLSTM) to process the feature sequences bidirectionally to establish the temporal relationship. The fusion of BiGRU-BiLSTM capabilities allows the model to process feature sequences more comprehensively and effectively. The proposed work focuses on face based biometric authentication for liveness detection using the VSR model to boost security against face spoofing. The existing face based biometric systems are widely used for individual authentication and verification but are still vulnerable to 3D masks and adversarial attacks. The VSR system can be added to existing face based verification systems as a second level authentication technique to identify a person with liveness. The working ideology of the VSR system will be based on the challenge response technique, where a person has to pronounce the passcode silently displayed on the screen. The VSR model assesses its effectiveness using word error rate (WER), which matches the pronounced passcode to the one presented on the screen. Overall, the proposed work aims to enhance the accuracy of VSR so that it can be combined with existing face based authentication systems. The proposed system outperforms the existing VSR system and obtained 1.3% WER. The significance of the proposed hybrid model is that it efficiently captures temporal dependencies, enhancing context embedding, improving robustness to input variability, reducing information loss, and enhancing performance and accuracy in modeling and analyzing passcode pronunciation patterns.

Publisher

IOS Press

Reference88 articles.

1. Deep audio-visual speech recognition;Afouras;IEEE Transactions on Pattern Analysis and Machine Intelligence,2018

2. Securing face liveness detection on mobile devices using unforgeable lip motion patterns;Zhou;IEEE Transactions on Mobile Computing,2024

3. M. Fulton, J. Sattar and R. Absar, Siren: Underwater robot-to-human communication using audio, IEEE Robotics and Automation Letters (2023).

4. Silent speech interface using lip-reading methods;Jothibalaji;International Conference on Biomedical Engineering Science and Technology,2023

5. Two-stage visual speech recognition for intensive care patients;Laux;Scientific Reports,2023