1. Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., & Vijayanarasimhan, S. (2016). YouTube-8M: A Large-Scale Video Classification Benchmark. arXiv preprint arXiv:1609.08675.
2. Agrawal, P., Carreira, J., & Malik, J. (2015). Learning to see by moving. In ICCV.
3. Alwassel, H., Mahajan, D., Korbar, B., Torresani, L., Ghanem, B., & Tran, D. (2020). Self-supervised learning by cross-modal audio-video clustering. In NeurIPS.
4. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lucic, M., & Schmid, C. (2021). ViViT: A video vision transformer. In ICCV.
5. Avila, S., Thome, N., Cord, M., & Valle, E. (2013). de A Araujo A. The visual codeword point of view. Compute vision and image understanding: Pooling in image representation.