Experience

Author:

Bosu Michael F.1,Macdonell Stephen G.2

Affiliation:

1. University of Otago and Waikato Institute of Technology, Hamilton, New Zealand

2. University of Otago, Auckland, New Zealand

Abstract

Data is a cornerstone of empirical software engineering (ESE) research and practice. Data underpin numerous process and project management activities, including the estimation of development effort and the prediction of the likely location and severity of defects in code. Serious questions have been raised, however, over the quality of the data used in ESE. Data quality problems caused by noise, outliers, and incompleteness have been noted as being especially prevalent. Other quality issues, although also potentially important, have received less attention. In this study, we assess the quality of 13 datasets that have been used extensively in research on software effort estimation. The quality issues considered in this article draw on a taxonomy that we published previously based on a systematic mapping of data quality issues in ESE. Our contributions are as follows: (1) an evaluation of the “fitness for purpose” of these commonly used datasets and (2) an assessment of the utility of the taxonomy in terms of dataset benchmarking. We also propose a template that could be used to both improve the ESE data collection/submission process and to evaluate other such datasets, contributing to enhanced awareness of data quality issues in the ESE community and, in time, the availability and use of higher-quality datasets.

Funder

University of Otago Postgraduate Publishing Bursary

Publisher

Association for Computing Machinery (ACM)

Subject

Information Systems and Management,Information Systems

Cited by 21 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Agile effort estimation in Colombia: An assessment and opportunities for improvement;Science of Computer Programming;2024-09

2. Extended Association Rule Mining and Its Application to Software Engineering Data Sets;International Journal of Software Engineering and Knowledge Engineering;2024-08-30

3. A random forest model for early-stage software effort estimation for the SEERA dataset;Information and Software Technology;2024-05

4. Review and Empirical Analysis of Machine Learning-Based Software Effort Estimation;IEEE Access;2024

5. An Integrative Theoretical Framework for Responsible Artificial Intelligence;International Journal of Digital Strategy, Governance, and Business Transformation;2023-12-15

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3