Improving the Recognition Performance of Lip Reading Using the Concatenated Three Sequence Keyframe Image Technique-Reference-Cited by-同舟云学术

Improving the Recognition Performance of Lip Reading Using the Concatenated Three Sequence Keyframe Image Technique

Published:2021-04-11 Issue:2 Volume:11 Page:6986-6992
ISSN:1792-8036
Container-title:Engineering, Technology & Applied Science Research
language:
Short-container-title:Eng. Technol. Appl. Sci. Res.

Author:

Poomhiran L.^ORCID,Meesad P.,Nuanmeesri S.^ORCID

Abstract

This paper proposes a lip reading method based on convolutional neural networks applied to Concatenated Three Sequence Keyframe Image (C3-SKI), consisting of (a) the Start-Lip Image (SLI), (b) the Middle-Lip Image (MLI), and (c) the End-Lip Image (ELI) which is the end of the pronunciation of that syllable. The lip area’s image dimensions were reduced to 32×32 pixels per image frame and three keyframes concatenate together were used to represent one syllable with a dimension of 96×32 pixels for visual speech recognition. Every three concatenated keyframes representing any syllable are selected based on the relative maximum and relative minimum related to the open lip’s width and height. The evaluation results of the model’s effectiveness, showed accuracy, validation accuracy, loss, and validation loss values at 95.06%, 86.03%, 4.61%, and 9.04% respectively, for the THDigits dataset. The C3-SKI technique was also applied to the AVDigits dataset, showing 85.62% accuracy. In conclusion, the C3-SKI technique could be applied to perform lip reading recognition.

Publisher

Engineering, Technology & Applied Science Research

Reference41 articles.

1. K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings IEEE Computer Visualization and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770-778. https://doi.org/10.1109/CVPR.2016.90

2. S. Fenghour, D. Chen, and P. Xiao, "Decoder-encoder LSTM for lip reading," in Proceedings of the 2019 8th International Conference on Software and Information Engineering, Cairo, Egypt, Apr. 9-12, 2019, pp. 162-166 https://doi.org/10.1145/3328833.3328845

3. S. Petridis, Z. Li, and M. Pantic, "End-to-end visual speech recognition with LSTMS," in Proceedings of the 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, New Orleans, LA, USA, Mar. 5-9, 2017, pp. 2592-2596. https://doi.org/10.1109/ICASSP.2017.7952625

4. S. Chung, J. S. Chung, and H. Kang, "Perfect match: Improved cross-modal embeddings for audio-visual synchronisation," in Proceedings of the 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, Brighton, UK, May 12-17, 2019, pp. 3965-3969. https://doi.org/10.1109/ICASSP.2019.8682524

5. R. Bi and M. Swerts, "A perceptual study of how rapidly and accurately audiovisual cues to utterance-final boundaries can be interpreted in Chinese and English," Speech Communication, vol. 95, pp. 68-77, 2017. https://doi.org/10.1016/j.specom.2017.07.002

Cited by 7 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Temporal Keyframe Technique based on CNN and LSTM for Enhancing Lip Reading Performance;2024 12th International Electrical Engineering Congress (iEECON);2024-03-06

2. Enhancing Human Motion Prediction through Joint-based Analysis and AVI Video Conversion;Mobile Networks and Applications;2023-11-03

3. Enhancing Human Motion Prediction through Joint-based Analysis and AVI Video Conversion;2023-06-29

4. Survey on Visual Speech Recognition using Deep Learning Techniques;2023 International Conference on Communication System, Computing and IT Applications (CSCITA);2023-03-31

5. Augmenting machine learning for Amharic speech recognition: a paradigm of patient’s lips motion detection;Multimedia Tools and Applications;2022-03-19