1. Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449--12460.
2. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597--1607.
3. Capture, Learning, and Synthesis of 3D Speaking Styles
4. FaceFormer: Speech-Driven 3D Facial Animation with Transformers
5. A 3-D Audio-Visual Corpus of Affective Communication