Rise of Distributed Deep Learning Training in the Big Model Era: From a Software Engineering Perspective-Reference-Cited by-同舟云学术

Rise of Distributed Deep Learning Training in the Big Model Era: From a Software Engineering Perspective

Published:2023-09-29 Issue:6 Volume:32 Page:1-26
ISSN:1049-331X
Container-title:ACM Transactions on Software Engineering and Methodology
language:en
Short-container-title:ACM Trans. Softw. Eng. Methodol.

Author:

Liu Xuanzhe¹^ORCID,Gu Diandian¹^ORCID,Chen Zhenpeng²^ORCID,Wen Jinfeng¹^ORCID,Zhang Zili¹^ORCID,Ma Yun¹^ORCID,Wang Haoyu³^ORCID,Jin Xin¹^ORCID

Affiliation:

1. Peking University, China

2. University College London, UK

3. Huazhong University of Science and Technology, China

Abstract

Deep learning (DL) has become a key component of modern software. In the “ big model ” era, the rich features of DL-based software (i.e., DL software) substantially rely on powerful DL models, e.g., BERT, GPT-3, and the recently emerging GPT-4, which are trained on the powerful cloud with large datasets. Hence, training effective DL models has become a vital stage in the whole software lifecycle. When training deep learning models, especially those big models, developers need to parallelize and distribute the computation and memory resources amongst multiple devices (e.g., a cluster of GPUs) in the training process, which is known as distributed deep learning training , or distributed training for short. However, the unique challenges that developers encounter in distributed training process have not been studied in the software engineering community. Given the increasingly heavy dependence of current DL-based software on distributed training, this paper aims to fill in the knowledge gap and presents the first comprehensive study on developers’ issues in distributed training. To this end, we focus on popular DL frameworks that support distributed training (including TensorFlow, PyTorch, Keras, and Horovod) and analyze 1,131 real-world developers’ issues about using these frameworks reported on Stack Overflow and GitHub. We construct a fine-grained taxonomy consisting of 30 categories regarding the fault symptoms and summarize common fix patterns for different symptoms. We find that: (1) many distributed-specific faults and non-distributed-specific faults inherently share the same fault symptoms, making it challenging to debug; (2) most of the fault symptoms have frequent fix patterns; (3) about half of the faults are related to system-level configurations. Based on the results, we suggest actionable implications on research avenues that can potentially facilitate the distributed training to develop DL-based software, such as focusing on the frequent and common fix patterns when designing testing or debugging tools, developing efficient testing and debugging techniques for communication configuration along with the synthesis of network configuration analysis, designing new multi-device checkpoint-and-replay techniques to help reproduction, and designing serverless APIs for cloud platforms.

Funder

National Natural Science Foundation of China

National Natural Science Fund for the Excellent Young Scientists Fund Program

Center for Data Space Technology and System, Peking University

ERC Advanced Grant

Publisher

Association for Computing Machinery (ACM)

Subject

Software

Link

https://dl.acm.org/doi/pdf/10.1145/3597204

Reference105 articles.

1. 2012. Parallel array or array of structures [closed]. Retrieved on December 21 2022https://stackoverflow.com/questions/13239607. (2012).

2. 2016. Distributed tensorflow on localhosts failed by “socket error connection refused”. Retrieved on March 16 2022https://stackoverflow.com/questions/38937984. (2016).

3. 2016. Synchronous vs asynchronous computation in Tensorflow. Retrieved on March 16 2022https://stackoverflow.com/questions/34349316/synchronous-vs-asynchronous-computation-in-tensorflow. (2016).

4. 2016. Why neural network tends to output “mean value”?Retrieved on December 21 2022https://stackoverflow.com/questions/39863606. (2016).

5. 2017. Baidu-Allreduce. Retrieved on March 16 2022https://github.com/baidu-research/baidu-allreduce. (2017).