1. Bert: Pre-training of deep bidirectional transformers for language understanding;Devlin;arXiv preprint,2018
2. Communication efficient distributed machine learning with the parameter server;Li;Advances in Neural Information Processing Systems,2014
3. Megatron-Im: Training multi-billion parameter language models using model parallelism;Shoeybi;arXiv preprint,2019