Taming unbalanced training workloads in deep learning with partial collective operations-Reference-Cited by-同舟云学术

Taming unbalanced training workloads in deep learning with partial collective operations

Published:2020-02-19 Issue: Volume: Page:
ISSN:
Container-title:Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
language:
Short-container-title:

Author:

Li Shigang¹,Ben-Nun Tal¹,Girolamo Salvatore Di¹,Alistarh Dan²,Hoefler Torsten¹

Affiliation:

1. ETH Zurich

2. IST Austria

Funder

European Research Council (ERC) under the European Union?s Horizon 2020 programme, grant agreement DAPP,

Publisher

ACM

Link

https://dl.acm.org/doi/pdf/10.1145/3332466.3374528

Reference62 articles.

1. Martín Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen Craig Citro Greg S. Corrado Andy Davis Jeffrey Dean Matthieu Devin Sanjay Ghemawat Ian Goodfellow Andrew Harp Geoffrey Irving Michael Isard Yangqing Jia Rafal Jozefowicz Lukasz Kaiser Manjunath Kudlur Josh Levenberg Dandelion Mané Rajat Monga Sherry Moore Derek Murray Chris Olah Mike Schuster Jonathon Sillens Benoit Steiner Ilya Sutskever Kunal Talwar Paul Tucker Vincent Vanhoucke Vijay Vasudevan Fernanda Viégas Oriol Vinyals Pete Warden Martin Wattenberg Martin Wicke Yuan Yu and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems https://www.tensorflow.org/ Software available from tensorflow.org. Martín Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen Craig Citro Greg S. Corrado Andy Davis Jeffrey Dean Matthieu Devin Sanjay Ghemawat Ian Goodfellow Andrew Harp Geoffrey Irving Michael Isard Yangqing Jia Rafal Jozefowicz Lukasz Kaiser Manjunath Kudlur Josh Levenberg Dandelion Mané Rajat Monga Sherry Moore Derek Murray Chris Olah Mike Schuster Jonathon Sillens Benoit Steiner Ilya Sutskever Kunal Talwar Paul Tucker Vincent Vanhoucke Vijay Vasudevan Fernanda Viégas Oriol Vinyals Pete Warden Martin Wattenberg Martin Wicke Yuan Yu and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems https://www.tensorflow.org/ Software available from tensorflow.org.

2. Dario Amodei and Danny Hernandez. 2018. AI and Compute. https://openai.com/blog/ai-and-compute/. Dario Amodei and Danny Hernandez. 2018. AI and Compute. https://openai.com/blog/ai-and-compute/.

3. A. Awan K. Hamidouche J. Hashmi and D. Panda. 2017. S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. A. Awan K. Hamidouche J. Hashmi and D. Panda. 2017. S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters.

Cited by 31 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Canary: Congestion-aware in-network allreduce using dynamic trees;Future Generation Computer Systems;2024-03

2. Communication Optimization Algorithms for Distributed Deep Learning Systems: A Survey;IEEE Transactions on Parallel and Distributed Systems;2023-12

3. Hanayo: Harnessing Wave-like Pipeline Parallelism for Enhanced Large Model Training Efficiency;Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis;2023-11-11

4. ADA-GP: Accelerating DNN Training By Adaptive Gradient Prediction;56th Annual IEEE/ACM International Symposium on Microarchitecture;2023-10-28

5. HPC² lusterScape: Increasing Transparency and Efficiency of Shared High-Performance Computing Clusters for Large-scale AI Models;2023 IEEE Visualization in Data Science (VDS);2023-10-15