1. Vaswani A , Shazeer N , Parmar N , et al. Attention is all you need. Advances in Neural Information Processing Systems. 2017:5998–6008.
2. Devlin J , Change M-W , Lee K , Toutanova K. BERT: Pre-training of Deep Bidirectinal Transformers for Language Understanding. Proceedings of NAACL-HLT 2019. 2019:4171–4186.
3. Radford A , Narasimhan K , Salimans T , Sutskever I. Improving Language Understanding by Generative Pre-Training. OpenAI. 2018. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
4. Radford A , Wu J , Child R , Luan D , Amodei D , Sutskever I. Language Models are Unsupervised Multitask Learners. OpenAI. 2019. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
5. Brown TB , Mann B , Ryder N , et al. Language Models are Few-Shot Learners. arXiv preprint arXiv:200514165. 2020. https://arxiv.org/abs/2005.14165