Abstract
The paper is focused on the relevant problem of speech recognition using additional sources besides the voice itself, in conditions in which the quality or availability of audio information is inadequate (for example, in the presence of noise or additional speakers). This is achieved by using automatic lip recognition (ARL) methods, which rely on non-acoustic biosignals generated by the human body during speech production. Among the applications of this approach are medical applications, as well as processing voice commands in languages with poor audio conditions. The aim of this work is to create a system for speech recognition based on a combination of speaker lip recognition (SSI) and context prediction. To achieve this goal, the following tasks were performed: to substantiate the systems for recognizing voice commands of a silent voice interface (SSI) based on a combination of two neural network architectures, to implement a model for recognizing visemes based on the CNN neural network architecture and an encoder-decoder architecture for the LSTM neural recurrent network model for analyzing and predicting the context of a speaker’s speech. The developed system was tested on a chosen dataset. The results show that the recognition error in different conditions averages from 4,34% to 5,12% for CER and from 5,52% to 6,06% for WER for the proposed ALR system in 7 experiments, which is an advantage over the LipNet project, which additionally processes audio data for the original without noise.
Publisher
National Academy of Sciences of Ukraine (Co. LTD Ukrinformnauka) (Publications)
Reference19 articles.
1. 1. Huang, X., Alleva, F., Hwang, M.-Y. and Rosenfeld, R. (1993). An overview of the SPHINX-II speech recognition system. CiteSeer X (The Pennsylvania State University). doi: https://doi.org/10.3115/1075671.1075690.
2. 2. Chung, J.S. and Zisserman, A. (2018). "Learning to lip read words by watching videos". Computer Vision and Image Understanding, 173, pp. 76-85. doi: https://doi.org/10.1016/j.cviu.2018.02.001.
3. 3. Rybach, D., Gollan, C., Heigold, G., Hoffmeister, B., Lööf, J., Schlüter, R., Ney, H. (2009). "The RWTH aachen university open source speech recognition system". Proc. Interspeech 2009, pp. 2111-2114, doi: 10.21437/Interspeech.2009-604.
4. 4. Tereshchenko, O.V., Barkovsʹka O.Yu. "Analiz vplyvu SSI-pidkhodu na produktyvnistʹ rozpiznavannya holosovykh komand". Materialy desyatoyi mizhnarodnoyi naukovo-tekhnichnoyi konferencii «Problemy informatyzatsiyi» (November, 24-25 2022) (In Ukrainian).
5. 5. Kapur, A., Kapur, S., & Maes, P. (2018). "Alterego: A personalized wearable silent speech interface". In 23rd International conference on intelligent user interfaces, Association for Computing Machinery, New York, NY, USA, pp. 43-53. https://doi.org/10.1145/3172944.3172977.