Author:
Zhang Xiaobing,Gong Haigang,Dai Xili,Yang Fan,Liu Nianbo,Liu Ming
Abstract
With the breakthrough of deep learning, lip reading technologies are under extraordinarily rapid progress. It is well-known that Chinese is the most widely spoken language in the world. Unlike alphabetic languages, it involves more than 1,000 pronunciations as Pinyin, and nearly 90,000 pictographic characters as Hanzi, which makes lip reading of Chinese very challenging. In this paper, we implement visual-only Chinese lip reading of unconstrained sentences in a two-step end-to-end architecture (LipCH-Net), in which two deep neural network models are employed to perform the recognition of Pictureto-Pinyin (mouth motion pictures to pronunciations) and the recognition of Pinyin-to-Hanzi (pronunciations to texts) respectively, before having a jointly optimization to improve the overall performance. In addition, two modules in the Pinyin-to-Hanzi model are pre-trained separately with large auxiliary data in advance of sequence-to-sequence training to make the best of long sequence matches for avoiding ambiguity. We collect 6-month daily news broadcasts from China Central Television (CCTV) website, and semi-automatically label them into a 20.95 GB dataset with 20,495 natural Chinese sentences. When trained on the CCTV dataset, the LipCH-Net model outperforms the performance of all stateof-the-art lip reading frameworks. According to the results, our scheme not only accelerates training and reduces overfitting, but also overcomes syntactic ambiguity of Chinese which provides a baseline for future relevant work.
Publisher
Association for the Advancement of Artificial Intelligence (AAAI)
Cited by
13 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. MKTN: Adversarial-Based Multifarious Knowledge Transfer Network from Complementary Teachers;International Journal of Computational Intelligence Systems;2024-03-28
2. Generalizing sentence-level lipreading to unseen speakers: a two-stream end-to-end approach;Multimedia Systems;2024-01-25
3. Silent Speech Interface Using Lip-Reading Methods;Communications in Computer and Information Science;2024
4. LipFormer: Learning to Lipread Unseen Speakers Based on Visual-Landmark Transformers;IEEE Transactions on Circuits and Systems for Video Technology;2023-09
5. Lip Reading Bengali Words;Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence;2022-12-23