1. Bert: Pre-training of deep bidirectional transformers for language understanding;Devlin
2. Language models are few-shot learners;Brown;Advances in neural information processing systems,2020
3. An image is worth 16x16 words: Transformers for image recognition at scale;Dosovitskiy;ICLR,2021
4. ViViT: A Video Vision Transformer