1. Attention is all you need;vaswani;Proc Adv Neural Inf Process Syst,2017
2. TeraPipe: Token-level pipeline parallelism for training large-scale language models;li;Proc 38th Int Conf Mach Learn,2021
3. Exploring hidden dimensions in accelerating convolutional neural networks;jia;Proc 35th Int Conf Mach Learn,2018
4. Sequence parallelism: Making 4D parallelism possible;li,2021
5. End-to-end adaptive distributed training on PaddlePaddle;ao,2021