Author:
Jia Junwei,Wang Zhilu,Xu Lianghui,Dai Jiajia,Gu Mingyi,Huang Jing
Abstract
Lip movements contain essential linguistic information. It is an important medium for studying the content of the dialogue. At present, there are many studies on how to improve the accuracy of lip language recognition models. However, there are few studies on the robustness and generalization performance of the model under various disturbances. Specific experiments show that the current state-of-the-art lip recognition model significantly drops in accuracy when disturbed and is particularly sensitive to adversarial examples. This paper substantially alleviates this problem by using Mixup training. Taking the model subjected to negative attacks generated by FGSM as an example, the model in this paper achieves 85.0% and 40.2% accuracy on the English dataset LRW and the Mandarin dataset LRW-1000, respectively. The correct recognition rates are improved by 9.8% and 8.3%, compared with the current advanced lip recognition models. The positive impact of Mixup training on the robustness and generalization of lip recognition models is demonstrated. In addition, the performance of the lip recognition classification model depends more on the training parameters, which increase the computational cost. The InvNet-18 network in this paper reduces the consumption of GPU resources and the training time while improving the model accuracy. Compared with the standard ResNet-18 network used in mainstream lip recognition models, the InvNet-18 network in this paper has more than three times lower GPU consumption and 32% fewer parameters. After detailed analysis and comparison in various aspects, it is demonstrated that the model in this paper can effectively improve the model’s anti-interference ability and reduce training resource consumption. At the same time, the accuracy is comparable with the current state-of-the-art results.
Subject
Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering
Reference41 articles.
1. End-to-end Audiovisual Speech Recognition;Petridis;Proceedings of the IEEE International Conference on Acoustics,2018
2. Lip Reading-Based User Authentication Through Acoustic Sensing on Smartphones
3. A review of recent advances in visual speech decoding
4. A survey of visual lip reading and lip-password verification;Mathulaprangsan;Proceedings of the 2015 International Conference on Orange Technologies (ICOT),2015
5. Lip feature selection based on BPSO and SVM;Wang;Proceedings of the IEEE 2011 10th International Conference on Electronic Measurement & Instruments,2011