1. Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)
2. Brown, T.B., et al.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)
3. Wang, C., Li, M., Smola, A.J.: Language Models with Transformers. https://arxiv.org/pdf/1904.09408.pdf (October 2019)
4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805 (October 2018)
5. Masked language modeling demo, AllenNLP, Allen Institute for AI. https://demo.allennlp.org/