Bagua-Reference-Cited by-同舟云学术

Bagua

Published:2021-12 Issue:4 Volume:15 Page:804-813
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Gan Shaoduo¹,Jiang Jiawei¹,Yuan Binhang¹,Zhang Ce¹,Lian Xiangru²,Wang Rui²,Chang Jianbin²,Liu Chengjun²,Shi Hongmei²,Zhang Shengzhuo²,Li Xianghong²,Sun Tengxu²,Yang Sen²,Liu Ji²

Affiliation:

1. ETH Zürich, Switzerland

2. kuaishou technology, China

Abstract

Recent years have witnessed a growing list of systems for distributed data-parallel training. Existing systems largely fit into two paradigms, i.e., parameter server and MPI-style collective operations. On the algorithmic side, researchers have proposed a wide range of techniques to lower the communication via "system relaxations": quantization, decentralization, and communication delay. However, most, if not all, existing systems only rely on standard synchronous and asynchronous stochastic gradient (SG) based optimization, therefore, cannot take advantage of all possible optimizations that the machine learning community has been developing recently. Given this emerging gap between the current landscapes of systems and theory, we build Bagua, a MPI-style communication library, providing a collection of primitives, that is both flexible and modular to support state-of-the-art system relaxation techniques of distributed training. Powered by this design, Bagua has a great ability to implement and extend various state-of-the-art distributed learning algorithms. In a production cluster with up to 16 machines (128 GPUs), Bagua can outperform PyTorch-DDP, Horovod and BytePS in the end-to-end training time by a significant margin (up to 2X) across a diverse range of tasks. Moreover, we conduct a rigorous tradeoff exploration showing that different algorithms and system relaxations achieve the best performance over different network conditions.

Publisher

Association for Computing Machinery (ACM)

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3503585.3503590

Reference82 articles.

1. [n.d.]. Apex. https://nvidia.github.io/apex/optimizers.html. [n.d.]. Apex. https://nvidia.github.io/apex/optimizers.html.

2. [n.d.]. NCCL. https://developer.nvidia.com/nccl. [n.d.]. NCCL. https://developer.nvidia.com/nccl.

3. Martín Abadi , Paul Barham , Jianmin Chen , Zhifeng Chen , Andy Davis , Jeffrey Dean , Matthieu Devin , Sanjay Ghemawat , Geoffrey Irving , Michael Isard , 2016 . Tensorflow: A system for large-scale machine learning. In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16). 265--283. Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16). 265--283.

4. Dan Alistarh , Demjan Grubic , Jerry Li , Ryota Tomioka , and Milan Vojnovic . 2016 . QSGD: Communication-efficient SGD via gradient quantization and encoding. arXiv preprint arXiv:1610.02132 (2016). Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2016. QSGD: Communication-efficient SGD via gradient quantization and encoding. arXiv preprint arXiv:1610.02132 (2016).

5. Dan Alistarh , Torsten Hoefler , Mikael Johansson , Sarit Khirirat , Nikola Konstantinov , and Cédric Renggli . 2018 . The convergence of sparsified gradient methods . In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 5977--5987 . Dan Alistarh, Torsten Hoefler, Mikael Johansson, Sarit Khirirat, Nikola Konstantinov, and Cédric Renggli. 2018. The convergence of sparsified gradient methods. In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 5977--5987.

Cited by 16 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A Multidimensional Communication Scheduling Method for Hybrid Parallel DNN Training;IEEE Transactions on Parallel and Distributed Systems;2024-08

2. Gsyn: Reducing Staleness and Communication Waiting via Grouping-based Synchronization for Distributed Deep Learning;IEEE INFOCOM 2024 - IEEE Conference on Computer Communications;2024-05-20

3. Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications;Proceedings of the Nineteenth European Conference on Computer Systems;2024-04-22

4. Decentralized bilevel optimization;Optimization Letters;2024-03-26

5. A Novel Federated Learning Framework Based on Conditional Generative Adversarial Networks for Privacy Preserving in 6G;Electronics;2024-02-16