1. Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4724–4733). Piscataway: IEEE.
2. Xie, S., Sun, C., Huang, J., Tu, Z., & Murphy, K. (2018). Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In V. Ferrari, M. Hebert, C. Sminchisescu, et al. (Eds.), Proceedings of the 15th European conference on computer vision (pp. 318–335). Cham: Springer.
3. Gu, C., Sun, C., Ross, D. A., Vondrick, C., Pantofaru, C., Li, Y., et al. (2018). AVA: a video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6047–6056). Piscataway: IEEE.
4. Heilbron, F.C., Escorcia, V., Ghanem, B., & Niebles, J. C. (2015). Activitynet: a large-scale video benchmark for human activity understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 961–970). Piscataway: IEEE.
5. Liu, Y., Albanie, S., Nagrani, A., & Zisserman, A. (2019). Use what you have: video retrieval using representations from collaborative experts. arXiv preprint. arXiv:1907.13487.