Affiliation:
1. Peking University, China
2. University College London, UK
3. Huazhong University of Science and Technology, China
Abstract
Deep learning (DL)
has become a key component of modern software. In the “
big model
” era, the rich features of DL-based software (i.e., DL software) substantially rely on powerful DL models, e.g., BERT, GPT-3, and the recently emerging GPT-4, which are trained on the powerful cloud with large datasets. Hence, training effective DL models has become a vital stage in the whole software lifecycle. When training deep learning models, especially those big models, developers need to parallelize and distribute the computation and memory resources amongst multiple devices (e.g., a cluster of GPUs) in the training process, which is known as
distributed deep learning training
, or
distributed training
for short. However, the unique challenges that developers encounter in distributed training process have not been studied in the software engineering community. Given the increasingly heavy dependence of current DL-based software on distributed training, this paper aims to fill in the knowledge gap and presents the first comprehensive study on developers’ issues in distributed training. To this end, we focus on popular DL frameworks that support distributed training (including TensorFlow, PyTorch, Keras, and Horovod) and analyze 1,131 real-world developers’ issues about using these frameworks reported on Stack Overflow and GitHub. We construct a fine-grained taxonomy consisting of 30 categories regarding the fault symptoms and summarize common fix patterns for different symptoms. We find that: (1) many distributed-specific faults and non-distributed-specific faults inherently share the same fault symptoms, making it challenging to debug; (2) most of the fault symptoms have frequent fix patterns; (3) about half of the faults are related to system-level configurations. Based on the results, we suggest actionable implications on research avenues that can potentially facilitate the distributed training to develop DL-based software, such as focusing on the frequent and common fix patterns when designing testing or debugging tools, developing efficient testing and debugging techniques for communication configuration along with the synthesis of network configuration analysis, designing new multi-device checkpoint-and-replay techniques to help reproduction, and designing serverless APIs for cloud platforms.
Funder
National Natural Science Foundation of China
National Natural Science Fund for the Excellent Young Scientists Fund Program
Center for Data Space Technology and System, Peking University
ERC Advanced Grant
Publisher
Association for Computing Machinery (ACM)
Reference105 articles.
1. 2012. Parallel array or array of structures [closed]. Retrieved on December 21 2022https://stackoverflow.com/questions/13239607. (2012).
2. 2016. Distributed tensorflow on localhosts failed by “socket error connection refused”. Retrieved on March 16 2022https://stackoverflow.com/questions/38937984. (2016).
3. 2016. Synchronous vs asynchronous computation in Tensorflow. Retrieved on March 16 2022https://stackoverflow.com/questions/34349316/synchronous-vs-asynchronous-computation-in-tensorflow. (2016).
4. 2016. Why neural network tends to output “mean value”?Retrieved on December 21 2022https://stackoverflow.com/questions/39863606. (2016).
5. 2017. Baidu-Allreduce. Retrieved on March 16 2022https://github.com/baidu-research/baidu-allreduce. (2017).