1. Triantafyllos Afouras , Joon Son Chung , Andrew Senior , Oriol Vinyals , and Andrew Zisserman . 2018. Deep audio-visual speech recognition . IEEE transactions on pattern analysis and machine intelligence ( 2018 ). Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2018. Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machine intelligence (2018).
2. T. Afouras J. S. Chung and A. Zisserman. 2018. LRS3-TED: a large-scale dataset for visual speech recognition. In arXiv preprint arXiv:1809.00496. T. Afouras J. S. Chung and A. Zisserman. 2018. LRS3-TED: a large-scale dataset for visual speech recognition. In arXiv preprint arXiv:1809.00496.
3. ASR is All You Need: Cross-Modal Distillation for Lip Reading
4. Wav2vec 2.0: A framework for self-supervised learning of speech representations;Baevski Alexei;Advances in Neural Information Processing Systems,2020
5. Hangbo Bao , Li Dong , Songhao Piao , and Furu Wei . 2022 . BEiT: BERT Pre-Training of Image Transformers. In ICLR 2022. Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. 2022. BEiT: BERT Pre-Training of Image Transformers. In ICLR 2022.