1. Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention is All You Need." In Advances in Neural Information Processing Systems (pp. 5998-6008).
2. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." arXiv preprint arXiv:1810.04805.
3. Brown, T., Mann, B., Ryder, N., et al. (2020). "Language Models are Few-Shot Learners." In Advances in Neural Information Processing Systems (Vol. 33, pp. 1877-1901).
4. Radford, A., Wu, J., Child, R., et al. (2019). "Language Models are Unsupervised Multitask Learners." OpenAI blog, 1(8), 9.
5. Raffel, C., Shazeer, N., Roberts, A., et al. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." Journal of Machine Learning Research, 21(140), 1-67.