Cross‐version defect prediction using threshold‐based active learning-Reference-Cited by-同舟云学术

Cross‐version defect prediction using threshold‐based active learning

Published:2023-04-02 Issue: Volume: Page:
ISSN:2047-7473
Container-title:Journal of Software: Evolution and Process
language:en
Short-container-title:J Software Evolu Process

Author:

Mei Yuanqing¹^ORCID,Liu Xutong¹,Lu Zeyu¹,Yang Yibiao¹,Liu Huihui¹,Zhou Yuming¹^ORCID

Affiliation:

1. State Key Laboratory for Novel Software Technology Nanjing University Nanjing China

Abstract

AbstractBecause defects in software modules (e.g., classes) might lead to product failure and financial loss, software defect prediction enables us to better understand and control software quality. Software development is a dynamic evolutionary process that may result in data distributions (e.g., defect characteristics) varying from version to version. In this case, effective cross‐version defect prediction (CVDP) is not easy to achieve. In this paper, we aim to investigate whether the defect prediction method of the threshold‐based active learning (TAL) can tackle the problem of the different data distribution between successive versions. Our TAL method includes two stages. At the active learning stage, a committee of investigated metrics is constructed to vote on the unlabeled modules of the current version. We pick up the unlabeled module with the median of voting scores to domain experts. The domain experts test and label the selected unlabeled module. Then, we merge the selected labeled module and the remaining modules with pseudo‐labels from the current version into the labeled modules of the prior version to form enhanced training data. Based on the training data, we derive the metric thresholds used for the next iteration. At the defect prediction stage, the iterations stop when a predefined threshold is reached. Finally, we use the cutoff threshold of voting scores, that is, 50%, to predict the defect‐prone of the remaining unlabeled modules. We evaluate the TAL method on 31 versions of 10 projects with three prevalent performance indicators. The results show that TAL outperforms the baseline methods, including three variations methods, two common supervised methods, and the state‐of‐the‐art method Hybrid Active Learning and Kernel PCA (HALKP). The results indicate that TAL can effectively address the different data distribution between successive versions. Furthermore, to keep the cost of extensive testing low in practice, selecting 5% of candidate modules from the current version is sufficient for TAL to achieve a good performance of defect prediction.

Funder

National Natural Science Foundation of China

Publisher

Wiley

Subject

Software

Link

https://onlinelibrary.wiley.com/doi/pdf/10.1002/smr.2563

Reference70 articles.

1. Defect Prediction between Software Versions with Active Learning and Dimensionality Reduction

2. Deriving object‐oriented metric thresholds: research problems, Progress, and challenges;Mei YQ;Ruan Jian Xue Bao/J Softw (in Chinese),2022