1. Attention is all you need;vaswani;Proceedings of the 31st International Conference on Neural Information Processing Systems ser NIPS'17,2017
2. BERT: Pretraining of deep bidirectional transformers for language understanding;devlin;Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies Volume 1 (Long and Short Papers),2019
3. Megatron-lm: Training multi-billion parameter language models using model parallelism;shoeybi;CoRR,2019
4. Language models are few-shot learners;brown;Advances in neural information processing systems,0
5. Scaling language models: Methods, analysis & insights from training gopher;rae;CoRR,2021