1. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C., 2021. Vivit: A video vision transformer, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6836–6846.
2. Bashivan, P., Rish, I., Yeasin, M., Codella, N., 2015. Learning representations from eeg with deep recurrent-convolutional neural networks. arXiv preprint arXiv:1511.06448.
3. Bertasius, G., Wang, H., Torresani, L., 2021. Is space-time attention all you need for video understanding. arXiv preprint arXiv:2102.05095 2, 4.
4. Language models are few-shot learners;Brown;Adv. Neural Inform. Process. Syst.,2020
5. A survey on the automatic indexing of video data;Brunelli;J. Visual Commun. Image Represent.,1999