1. Hassan Akbari , Liangzhe Yuan , Rui Qian , Wei-Hong Chuang , Shih-Fu Chang , Yin Cui , and Boqing Gong . 2021 . Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. NeurIPS (2021). Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. 2021. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. NeurIPS (2021).
2. Self-supervised multimodal versatile networks;Alayrac Jean-Baptiste;NeurIPS,2020
3. Humam Alwassel , Dhruv Mahajan , Bruno Korbar , Lorenzo Torresani , Bernard Ghanem , and Du Tran . 2020. Self-supervised learning by cross-modal audio-video clustering. NeurIPS ( 2020 ). Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, and Du Tran. 2020. Self-supervised learning by cross-modal audio-video clustering. NeurIPS (2020).
4. Relja Arandjelovic and Andrew Zisserman. 2017. Look listen and learn. In ICCV. Relja Arandjelovic and Andrew Zisserman. 2017. Look listen and learn. In ICCV.
5. Relja Arandjelovic and Andrew Zisserman. 2018. Objects that sound. In ECCV. 435--451. Relja Arandjelovic and Andrew Zisserman. 2018. Objects that sound. In ECCV. 435--451.