Funder
National Natural Science Foundation of China
Publisher
Springer Science and Business Media LLC
Reference82 articles.
1. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., & Schmid, C. (2021). Vivit: A video vision transformer. In Proceedings of the ieee/cvf international conference on computer vision (pp. 6836–6846).
2. Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video understanding? Icml (Vol. 2, p. 4).
3. Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In proceedings of the ieee conference on computer vision and pattern recognition (pp. 6299–6308).
4. Chen, J., & Ho, C. M. (2022). Mm-vit: Multi-modal video transformer for compressed video action recognition. In Proceedings of the ieee/cvf winter conference on applications of computer vision (pp. 1910–1921).
5. Cordonnier, J.-B., Loukas, A., & Jaggi, M. (2019). On the relationship between self-attention and convolutional layers. arXiv:1911.03584