1. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
2. [3] NVIDIA Mellanox ConnectX-5. 2020. https://www.nvidia.com/en-us/networking/ethernet/connectx-5/.
3. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
4. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity;Fedus William;Journal of Machine Learning Research,2022