Assessments of Feature Selection Techniques with Respect to Data Sampling for Highly Imbalanced Software Measurement Data-Reference-Cited by-同舟云学术

Assessments of Feature Selection Techniques with Respect to Data Sampling for Highly Imbalanced Software Measurement Data

Published:2015-04 Issue:02 Volume:22 Page:1550010
ISSN:0218-5393
Container-title:International Journal of Reliability, Quality and Safety Engineering
language:en
Short-container-title:Int. J. Rel. Qual. Saf. Eng.

Author:

Gao Kehan¹,Khoshgoftaar Taghi M.²

Affiliation:

1. Department of Mathematics and Computer Science, Eastern Connecticut State University, 83 Windham Street, Willimantic, Connecticut 06226, USA

2. Department of Computer and Electrical Engineering and Computer Science, Florida Atlantic University, 777 Glades Road, Boca Raton, Florida 33431, USA

Abstract

In the process of software defect prediction, a classification model is first built using software metrics and fault data gathered from a past software development project, then that model is applied to data in a similar project or a new release of the same project to predict new program modules as either fault-prone (fp) or not-fault-prone (nfp). The benefit of such a model is to facilitate the optimal use of limited financial and human resources for software testing and inspection. The predictive power of a classification model constructed from a given data set is affected by many factors. In this paper, we are more interested in two problems that often arise in software measurement data: high dimensionality and unequal example set size of the two types of modules (e.g., many more nfp modules than fp modules found in a data set). These directly result in learning time extension and a decline in predictive performance of classification models. We consider using data sampling followed by feature selection (FS) to deal with these problems. Six data sampling strategies (which are made up of three sampling techniques, each consisting of two post-sampling proportion ratios) and six commonly used feature ranking approaches are employed in this study. We evaluate the FS techniques by means of: (1) a general method, i.e., assessing the classification performance after the training data is modified, and (2) studying the stability of a FS method, specifically with the goal of understanding the effect of data sampling techniques on the stability of FS when using the sampled data. The experiments were performed on nine data sets from a real-world software project. The results demonstrate that the FS techniques that most enhance the models' classification performance do not also show the best stability, and vice versa. In addition, the classification performance is more affected by the sampling techniques themselves rather than by the post-sampling proportions, whereas this is opposite for the stability.

Publisher

World Scientific Pub Co Pte Lt

Subject

Electrical and Electronic Engineering,Industrial and Manufacturing Engineering,Energy Engineering and Power Technology,Aerospace Engineering,Safety, Risk, Reliability and Quality,Nuclear Energy and Engineering,General Computer Science

Link

https://www.worldscientific.com/doi/pdf/10.1142/S0218539315500102

Reference26 articles.

1. Two-Stage Cost-Sensitive Learning for Software Defect Prediction

2. A General Software Defect-Proneness Prediction Framework

3. Predicting Bugs from History

4. Choosing software metrics for defect prediction: an investigation on feature selection techniques

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A systematic literature review on software defect prediction using artificial intelligence: Datasets, Data Validation Methods, Approaches, and Tools;Engineering Applications of Artificial Intelligence;2022-05

2. Empirical assessment of feature selection techniques in defect prediction models using web applications;Journal of Intelligent & Fuzzy Systems;2019-06-11