1. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., & Schmid, C. (2021). Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6836–6846).
2. Bensch, R., Scherf, N., Huisken, J., Brox, T., & Ronneberger, O. (2017). Spatiotemporal deformable prototypes for motion anomaly detection. International Journal of Computer Vision, 122(3), 502–523.
3. Bertasius, G., Feichtenhofer, C., Tran, D., Shi, J., & Torresani, L. (2018). Learning discriminative motion features through detection. arXiv:1812.04172
4. Bertasius, G., Wang, H., &Torresani, L. (2021). Is space-time attention all you need for video understanding? arXiv:2102.05095
5. Bulat, A., Perez Rua, J. M., Sudhakaran, S., Martinez, B., & Tzimiropoulos, G. (2021). Space-time mixing attention for video transformer. Advances in Neural Information Processing Systems, 34