Abstract
Communication through speech can be hindered by environmental noise, prompting the need for alternative methods such as lip reading, which bypasses auditory challenges. However, the accurate interpretation of lip movements is impeded by the uniqueness of individual lip shapes, necessitating detailed analysis. In addition, the development of an Indonesian dataset addresses the lack of diversity in existing datasets, predominantly in English, fostering more inclusive research. This study proposes an enhanced lip‐reading system trained using the long‐term recurrent convolutional network (LRCN) considering eight different types of lip shapes. MediaPipe Face Mesh precisely detects lip landmarks, enabling the LRCN model to recognize Indonesian utterances. Experimental results demonstrate the effectiveness of the approach, with the LRCN model with three convolutional layers (LRCN‐3Conv) achieving 95.42% accuracy for word test data and 95.63% for phrases, outperforming the convolutional long short‐term memory (Conv‐LSTM) method. The proposed approach outperforms Conv‐LSTM in terms of accuracy. Furthermore, the evaluation of the original MIRACL‐VC1 dataset also produced a best accuracy of 90.67% on LRCN‐3Conv compared to previous studies in the word‐labeled class. The success is attributed to MediaPipe Face Mesh detection, which facilitates the accurate detection of the lip region. Leveraging advanced deep learning techniques and precise landmark detection, these findings promise improved communication accessibility for individuals facing auditory challenges.
Funder
Kementerian Pendidikan, Kebudayaan, Riset, dan Teknologi