Author:
Tran Thanh-Hai,Do Vuong-Loc
Publisher
Springer Nature Singapore
Reference10 articles.
1. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: ViViT: a video vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6836–6846 (2021)
2. Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML, vol. 2, p. 4 (2021)
3. Dosovitskiy, A., et al: An image is worth $$16 \times 16$$ words: transformers for image recognition at scale (2021)
4. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. CoRR abs/1604.06573 (2016)
5. Kong, Y., Fu, Y.: Human action recognition and prediction: a survey. Int. J. Comput. Vis. 130(5), 1366–1401 (2022)