Lip-Reading Advancements: A 3D Convolutional Neural Network/Long Short-Term Memory Fusion for Precise Word Recognition
-
Published:2024-02-04
Issue:1
Volume:4
Page:410-422
-
ISSN:2673-7426
-
Container-title:BioMedInformatics
-
language:en
-
Short-container-title:BioMedInformatics
Author:
Exarchos Themis1ORCID, Dimitrakopoulos Georgios N.1ORCID, Vrahatis Aristidis G.1ORCID, Chrysovitsiotis Georgios2ORCID, Zachou Zoi2ORCID, Kyrodimos Efthymios2ORCID
Affiliation:
1. Department of Informatics, Ionian University, 49100 Corfu, Greece 2. 1st Otorhinolaryngology Department, National and Kapodistrian University of Athens, 11527 Athens, Greece
Abstract
Lip reading, the art of deciphering spoken words from the visual cues of lip movements, has garnered significant interest for its potential applications in diverse fields, including assistive technologies, human–computer interaction, and security systems. With the rapid advancements in technology and the increasing emphasis on non-verbal communication methods, the significance of lip reading has expanded beyond its traditional boundaries. These technological advancements have led to the generation of large-scale and complex datasets, necessitating the use of cutting-edge deep learning tools that are adept at handling such intricacies. In this study, we propose an innovative approach combining 3D Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks to tackle the challenging task of word recognition from lip movements. Our research leverages a meticulously curated dataset, named MobLip, encompassing various speech patterns, speakers, and environmental conditions. The synergy between the spatial information extracted by 3D CNNs and the temporal dynamics captured by LSTMs yields impressive results, achieving an accuracy rate of up to 87.5%, showcasing robustness to lighting variations and speaker diversity. Comparative experiments demonstrate our model’s superiority over existing lip-reading approaches, underlining its potential for real-world deployment. Furthermore, we discuss ethical considerations and propose avenues for future research, such as multimodal integration with audio data and expanded language support. In conclusion, our 3D CNN-LSTM architecture presents a promising solution to the complex problem of word recognition from lip movements, contributing to the advancement of communication technology and opening doors to innovative applications in an increasingly visual world.
Funder
Hellenic Foundation for Research and Innovation
Reference29 articles.
1. Lucey, P., Cohn, J.F., Kanade, T., Saragih, J., Ambadar, Z., and Matthews, I. (2010, January 13–18). The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, San Francisco, CA, USA. 2. Sheng, C., Kuang, G., Bai, L., Hou, C., Guo, Y., Xu, X., Pietikäinen, M., and Liu, L. (2022). Deep learning for visual speech analysis: A survey. arXiv. 3. Haliassos, A., Vougioukas, K., Petridis, S., and Pantic, M. (2021, January 20–25). Lips don’t lie: A generalisable and robust approach to face forgery detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA. 4. Realistic speech-driven facial animation with gans;Vougioukas;Int. J. Comput. Vis.,2020 5. CASDD: Automatic Surface Defect Detection Using a Complementary Adversarial Network;Tian;IEEE Sens. J.,2022
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Script Generation for Silent Speech in E-Learning;Advances in Educational Technologies and Instructional Design;2024-06-03 2. HNet: A deep learning based hybrid network for speaker dependent visual speech recognition;International Journal of Hybrid Intelligent Systems;2024-06-03
|
|