1. BERT: Pre-training of deep bidirectional transformers for language understanding;Devlin
2. ELECTRA: Pre-training text encoders as discriminators rather than generators;Clark
3. Exploring the limits of transfer learning with a unified text-to-text transformer;Raffel;The Journal of Machine Learning Research,2020
4. Language models are few-shot learners;Brown,2020