A fine-grained data set and analysis of tangling in bug fixing commits

Author:

Herbold SteffenORCID,Trautsch Alexander,Ledel Benjamin,Aghamohammadi Alireza,Ghaleb Taher A.,Chahal Kuljit Kaur,Bossenmaier Tim,Nagaria Bhaveet,Makedonski Philip,Ahmadabadi Matin Nili,Szabados Kristof,Spieker Helge,Madeja Matej,Hoy Nathaniel,Lenarduzzi Valentina,Wang Shangwen,Rodríguez-Pérez Gema,Colomo-Palacios Ricardo,Verdecchia Roberto,Singh Paramvir,Qin Yihao,Chakroborti Debasish,Davis Willard,Walunj Vijay,Wu Hongjun,Marcilio Diego,Alam Omar,Aldaeej Abdullah,Amit Idan,Turhan Burak,Eismann Simon,Wickert Anna-Katharina,Malavolta Ivano,Sulír Matúš,Fard Fatemeh,Henley Austin Z.,Kourtzanidis Stratos,Tuzun Eray,Treude Christoph,Shamasbi Simin Maleki,Pashchenko Ivan,Wyrich Marvin,Davis James,Serebrenik Alexander,Albrecht Ella,Aktas Ethem Utku,Strüber Daniel,Erbel Johannes

Abstract

Abstract Context Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs. Objective We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits. Methods We use a crowd sourcing approach for manual labeling to validate which changes contribute to bug fixes for each line in bug fixing commits. Each line is labeled by four participants. If at least three participants agree on the same label, we have consensus. Results We estimate that between 17% and 32% of all changes in bug fixing commits modify the source code to fix the underlying problem. However, when we only consider changes to the production code files this ratio increases to 66% to 87%. We find that about 11% of lines are hard to label leading to active disagreements between participants. Due to confirmed tangling and the uncertainty in our data, we estimate that 3% to 47% of data is noisy without manual untangling, depending on the use case. Conclusion Tangled commits have a high prevalence in bug fixes and can lead to a large amount of noise in the data. Prior research indicates that this noise may alter results. As researchers, we should be skeptics and assume that unvalidated data is likely very noisy, until proven otherwise.

Funder

Technische Universität Clausthal

Publisher

Springer Science and Business Media LLC

Subject

Software

Cited by 14 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. A code change‐oriented approach to just‐in‐time defect prediction with multiple input semantic fusion;Expert Systems;2024-08-27

2. On Refining the SZZ Algorithm with Bug Discussion Data;Empirical Software Engineering;2024-07-24

3. JIT-Smart: A Multi-task Learning Framework for Just-in-Time Defect Prediction and Localization;Proceedings of the ACM on Software Engineering;2024-07-12

4. An Exploratory Study of Programmers' Analogical Reasoning and Software History Usage During Code Re-Purposing;Proceedings of the 2024 IEEE/ACM 17th International Conference on Cooperative and Human Aspects of Software Engineering;2024-04-14

5. Delving into Parameter-Efficient Fine-Tuning in Code Change Learning: An Empirical Study;2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER);2024-03-12

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3