1. ViViT: A Video Vision Transformer
2. BEiT: BERT pre-training of image transformers;Bao
3. Language models are few-shot learners;Brown
4. Generative pre-training from pixels;Chen
5. A simple framework for contrastive learning of visual representations;Chen