On the Reproducibility and Replicability of Deep Learning in Software Engineering

Author:

Liu Chao1,Gao Cuiyun2,Xia Xin3,Lo David4,Grundy John5,Yang Xiaohu6

Affiliation:

1. Zhejiang University, Hangzhou, Zhejiang, China, and Chongqing University, Hangzhou, Chongqing, China

2. Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, China

3. Huawei, Hangzhou, Zhejiang, China

4. Singapore Management University, Singapore

5. Monash University, Clayton, Victoria, Australia

6. Zhejiang University, Hangzhou, Zhejiang, China, and PengCheng Laboratory, Shenzhen, Guangdong, China

Abstract

Context: Deep learning (DL) techniques have gained significant popularity among software engineering (SE) researchers in recent years. This is because they can often solve many SE challenges without enormous manual feature engineering effort and complex domain knowledge. Objective: Although many DL studies have reported substantial advantages over other state-of-the-art models on effectiveness, they often ignore two factors: (1) reproducibility —whether the reported experimental results can be obtained by other researchers using authors’ artifacts (i.e., source code and datasets) with the same experimental setup; and (2) replicability —whether the reported experimental result can be obtained by other researchers using their re-implemented artifacts with a different experimental setup. We observed that DL studies commonly overlook these two factors and declare them as minor threats or leave them for future work. This is mainly due to high model complexity with many manually set parameters and the time-consuming optimization process, unlike classical supervised machine learning (ML) methods (e.g., random forest). This study aims to investigate the urgency and importance of reproducibility and replicability for DL studies on SE tasks. Method: In this study, we conducted a literature review on 147 DL studies recently published in 20 SE venues and 20 AI (Artificial Intelligence) venues to investigate these issues. We also re-ran four representative DL models in SE to investigate important factors that may strongly affect the reproducibility and replicability of a study. Results: Our statistics show the urgency of investigating these two factors in SE, where only 10.2% of the studies investigate any research question to show that their models can address at least one issue of replicability and/or reproducibility. More than 62.6% of the studies do not even share high-quality source code or complete data to support the reproducibility of their complex models. Meanwhile, our experimental results show the importance of reproducibility and replicability, where the reported performance of a DL model could not be reproduced for an unstable optimization process. Replicability could be substantially compromised if the model training is not convergent, or if performance is sensitive to the size of vocabulary and testing data. Conclusion: It is urgent for the SE community to provide a long-lasting link to a high-quality reproduction package, enhance DL-based solution stability and convergence, and avoid performance sensitivity on different sampled data.

Funder

National Science Foundation of China

Key Research and Development Program of Zhejiang Province

National Research Foundation, Singapore

stable support plan for colleges and universities in Shenzhen

Publisher

Association for Computing Machinery (ACM)

Subject

Software

Reference217 articles.

1. The use of artificial neural networks for extracting actions and actors from requirements document;Al-Hroob Aysh;Inf. Softw. Technol.,2018

2. A systematic literature review of software effort prediction using machine learning methods;Ali Asad;J. Softw.: Evolut. Process,2019

3. Miltiadis Allamanis Marc Brockschmidt and Mahmoud Khademi. 2017. Learning to represent programs with graphs. In ICLR. Retrieved from https://github.com/Microsoft/gated-graph-neural-network-samples. Miltiadis Allamanis Marc Brockschmidt and Mahmoud Khademi. 2017. Learning to represent programs with graphs. In ICLR. Retrieved from https://github.com/Microsoft/gated-graph-neural-network-samples.

4. Miltiadis Allamanis Hao Peng and Charles Sutton. 2016. A convolutional attention network for extreme summarization of source code. In ICML. 2091–2100. Retrieved from http://groups.inf.ed.ac.uk/cup/codeattention/. Miltiadis Allamanis Hao Peng and Charles Sutton. 2016. A convolutional attention network for extreme summarization of source code. In ICML. 2091–2100. Retrieved from http://groups.inf.ed.ac.uk/cup/codeattention/.

Cited by 22 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3