1. Hassan Akbari , Linagzhe Yuan , Rui Qian , Wei-Hong Chuang , Shih-Fu Chang , Yin Cui , and Boqing Gong . 2021 . Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. NeurIPS (2021). Hassan Akbari, Linagzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. 2021. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. NeurIPS (2021).
2. Jean-Baptiste Alayrac , Adria Recasens , Rosalia Schneider , Relja Arandjelovic , Jason Ramapuram , Jeffrey De Fauw , Lucas Smaira, Sander Dieleman, and Andrew Zisserman. 2020 . Self-Supervised MultiModal Versatile Networks. NeurIPS 2, 6 (2020). Jean-Baptiste Alayrac, Adria Recasens, Rosalia Schneider, Relja Arandjelovic, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. 2020. Self-Supervised MultiModal Versatile Networks. NeurIPS 2, 6 (2020).
3. Humam Alwassel Dhruv Mahajan Bruno Korbar Lorenzo Torresani Bernard Ghanem and Du Tran. 2020. Self-Supervised Learning by Cross-Modal Audio-Video Clustering. In Advances in Neural Information Processing Systems (NeurIPS). Humam Alwassel Dhruv Mahajan Bruno Korbar Lorenzo Torresani Bernard Ghanem and Du Tran. 2020. Self-Supervised Learning by Cross-Modal Audio-Video Clustering. In Advances in Neural Information Processing Systems (NeurIPS).
4. Davide Anguita , Alessandro Ghio , Luca Oneto , Xavier Parra , and Jorge Luis Reyes-Ortiz . 2013 . A public domain dataset for human activity recognition using smartphones .. In Esann , Vol. 3 . Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra, and Jorge Luis Reyes-Ortiz. 2013. A public domain dataset for human activity recognition using smartphones.. In Esann, Vol. 3.
5. Look, Listen and Learn