1. Xlnet: Generalized autoregressive pretraining for language understanding;yang;NeurIPS,2019
2. Bert: Pretraining of deep bidirectional transformers for language understanding;devlin;NAACL,2019
3. Roberta: A robustly optimized bert pretraining approach;liu,2019
4. ELECTRA: Pre-training text encoders as discriminators rather than generators;clark;ICLRE,2020
5. An image is worth 16x16 words: Transformers for image recognition at scale;dosovitskiy;International Conference on Learning Representations,2021