Rise of Distributed Deep Learning Training in the Big Model Era: From a Software Engineering Perspective

Author:

Liu Xuanzhe1ORCID,Gu Diandian1ORCID,Chen Zhenpeng2ORCID,Wen Jinfeng1ORCID,Zhang Zili1ORCID,Ma Yun1ORCID,Wang Haoyu3ORCID,Jin Xin1ORCID

Affiliation:

1. Peking University, China

2. University College London, UK

3. Huazhong University of Science and Technology, China

Abstract

Deep learning (DL) has become a key component of modern software. In the “ big model ” era, the rich features of DL-based software (i.e., DL software) substantially rely on powerful DL models, e.g., BERT, GPT-3, and the recently emerging GPT-4, which are trained on the powerful cloud with large datasets. Hence, training effective DL models has become a vital stage in the whole software lifecycle. When training deep learning models, especially those big models, developers need to parallelize and distribute the computation and memory resources amongst multiple devices (e.g., a cluster of GPUs) in the training process, which is known as distributed deep learning training , or distributed training for short. However, the unique challenges that developers encounter in distributed training process have not been studied in the software engineering community. Given the increasingly heavy dependence of current DL-based software on distributed training, this paper aims to fill in the knowledge gap and presents the first comprehensive study on developers’ issues in distributed training. To this end, we focus on popular DL frameworks that support distributed training (including TensorFlow, PyTorch, Keras, and Horovod) and analyze 1,131 real-world developers’ issues about using these frameworks reported on Stack Overflow and GitHub. We construct a fine-grained taxonomy consisting of 30 categories regarding the fault symptoms and summarize common fix patterns for different symptoms. We find that: (1) many distributed-specific faults and non-distributed-specific faults inherently share the same fault symptoms, making it challenging to debug; (2) most of the fault symptoms have frequent fix patterns; (3) about half of the faults are related to system-level configurations. Based on the results, we suggest actionable implications on research avenues that can potentially facilitate the distributed training to develop DL-based software, such as focusing on the frequent and common fix patterns when designing testing or debugging tools, developing efficient testing and debugging techniques for communication configuration along with the synthesis of network configuration analysis, designing new multi-device checkpoint-and-replay techniques to help reproduction, and designing serverless APIs for cloud platforms.

Funder

National Natural Science Foundation of China

National Natural Science Fund for the Excellent Young Scientists Fund Program

Center for Data Space Technology and System, Peking University

ERC Advanced Grant

Publisher

Association for Computing Machinery (ACM)

Subject

Software

Reference105 articles.

1. 2012. Parallel array or array of structures [closed]. Retrieved on December 21 2022https://stackoverflow.com/questions/13239607. (2012).

2. 2016. Distributed tensorflow on localhosts failed by “socket error connection refused”. Retrieved on March 16 2022https://stackoverflow.com/questions/38937984. (2016).

3. 2016. Synchronous vs asynchronous computation in Tensorflow. Retrieved on March 16 2022https://stackoverflow.com/questions/34349316/synchronous-vs-asynchronous-computation-in-tensorflow. (2016).

4. 2016. Why neural network tends to output “mean value”?Retrieved on December 21 2022https://stackoverflow.com/questions/39863606. (2016).

5. 2017. Baidu-Allreduce. Retrieved on March 16 2022https://github.com/baidu-research/baidu-allreduce. (2017).

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3