1. ViViT: A Video Vision Transformer
2. Beit: Bert pre-training of image transformers;Bao
3. Is space-time attention all you need for video understanding?;Bertasius;ICML,2021
4. Generative pretraining from pixels;Chen
5. A simple framework for contrastive learning of visual representations;Chen