Predicting defects in imbalanced data using resampling methods: an empirical investigation-Reference-Cited by-同舟云学术

Predicting defects in imbalanced data using resampling methods: an empirical investigation

Published:2022-04-29 Issue: Volume:8 Page:e573
ISSN:2376-5992
Container-title:PeerJ Computer Science
language:en
Short-container-title:

Author:

Malhotra Ruchika¹,Jain Juhi²

Affiliation:

1. Department of Software Engineering, Delhi Technological University (former Delhi College of Engineering), Shahbad Daulatpur, Delhi, India

2. Department of Computer Science and Engineering, Delhi Technological University (former Delhi College of Engineering), Shahbad Daulatpur, Delhi, India

Abstract

The development of correct and effective software defect prediction (SDP) models is one of the utmost needs of the software industry. Statistics of many defect-related open-source data sets depict the class imbalance problem in object-oriented projects. Models trained on imbalanced data leads to inaccurate future predictions owing to biased learning and ineffective defect prediction. In addition to this large number of software metrics degrades the model performance. This study aims at (1) identification of useful metrics in the software using correlation feature selection, (2) extensive comparative analysis of 10 resampling methods to generate effective machine learning models for imbalanced data, (3) inclusion of stable performance evaluators—AUC, GMean, and Balance and (4) integration of statistical validation of results. The impact of 10 resampling methods is analyzed on selected features of 12 object-oriented Apache datasets using 15 machine learning techniques. The performances of developed models are analyzed using AUC, GMean, Balance, and sensitivity. Statistical results advocate the use of resampling methods to improve SDP. Random oversampling portrays the best predictive capability of developed defect prediction models. The study provides a guideline for identifying metrics that are influential for SDP. The performances of oversampling methods are superior to undersampling methods.

Publisher

PeerJ

Subject

General Computer Science

Link

https://peerj.com/articles/cs-573.pdf

Reference76 articles.

1. Is “Better Data” better than “Better Data Miners”?;Agrawal,2018

2. Instance-based learning algorithms;Aha;Machine Learning,1991

3. Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework;Alcalá-Fdez;Journal of Multiple-Valued Logic & Soft Computing,2011

4. A feature dependent naive Bayes approach and its application to the software defect prediction problem;Arar;Applied Soft Computing,2017

5. Performance analysis of feature selection methods in software defect prediction: a search method approach;Balogun;Applied Sciences,2019

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Handling class overlap and imbalance using overlap driven under-sampling with balanced random forest in software defect prediction;Innovations in Systems and Software Engineering;2024-06-18

2. Feature selection based on neighborhood rough sets and Gini index;PeerJ Computer Science;2023-12-12

3. An empirical evaluation of defect prediction approaches in within-project and cross-project context;Software Quality Journal;2023-03-04

4. Software Sentiment Analysis using Deep-learning Approach with Word-Embedding Techniques;Annals of Computer Science and Information Systems;2022-09-26