1. Sami Abu-El-Haija Nisarg Kothari Joonseok Lee Paul Natsev George Toderici Balakrishnan Varadarajan and Sudheendra Vijayanarasimhan. 2016. YouTube-8M: A Large-Scale Video Classification Benchmark. arxiv:1609.08675 [cs.CV] Sami Abu-El-Haija Nisarg Kothari Joonseok Lee Paul Natsev George Toderici Balakrishnan Varadarajan and Sudheendra Vijayanarasimhan. 2016. YouTube-8M: A Large-Scale Video Classification Benchmark. arxiv:1609.08675 [cs.CV]
2. Triantafyllos Afouras , Andrew Owens , Joon Son Chung , and Andrew Zisserman . 2020. Self-supervised Learning of Audio-Visual Objects from Video. Vol. 12363 LNCS . Springer International Publishing . 208–224 pages. Triantafyllos Afouras, Andrew Owens, Joon Son Chung, and Andrew Zisserman. 2020. Self-supervised Learning of Audio-Visual Objects from Video. Vol. 12363 LNCS. Springer International Publishing. 208–224 pages.
3. Unaiza Ahsan , Rishi Madhok , and Irfan Essa . 2019 . Video jigsaw: Unsupervised learning of spatiotemporal context for video action recognition . In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, WACV, 179–189 . Unaiza Ahsan, Rishi Madhok, and Irfan Essa. 2019. Video jigsaw: Unsupervised learning of spatiotemporal context for video action recognition. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, WACV, 179–189.
4. Hassan Akbari , Linagzhe Yuan , Rui Qian , Wei-Hong Chuang , Shih-Fu Chang , Yin Cui , and Boqing Gong . 2021 . Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. In Generating videos with scene dynamics (NeurIPs). Hassan Akbari, Linagzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. 2021. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. In Generating videos with scene dynamics (NeurIPs).
5. Jean-Baptiste Alayrac , Adria Recasens , Rosalia Schneider , Relja Arandjelović , Jason Ramapuram , Jeffrey De Fauw , Lucas Smaira , Sander Dieleman , and Andrew Zisserman . 2020. Self-supervised multimodal versatile networks. Generating videos with scene dynamics (NeurIPs) 33 ( 2020 ), 25–37. Jean-Baptiste Alayrac, Adria Recasens, Rosalia Schneider, Relja Arandjelović, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. 2020. Self-supervised multimodal versatile networks. Generating videos with scene dynamics (NeurIPs) 33 (2020), 25–37.