Data cleaning issues in class imbalanced datasets: instance selection and missing values imputation for one-class classifiers-Reference-Cited by-同舟云学术

Data cleaning issues in class imbalanced datasets: instance selection and missing values imputation for one-class classifiers

Published:2021-05-14 Issue:5 Volume:55 Page:771-787
ISSN:2514-9288
Container-title:Data Technologies and Applications
language:en
Short-container-title:DTA

Author:

Wang Zhenyuan^ORCID,Tsai Chih-Fong^ORCID,Lin Wei-Chao^ORCID

Abstract

PurposeClass imbalance learning, which exists in many domain problem datasets, is an important research topic in data mining and machine learning. One-class classification techniques, which aim to identify anomalies as the minority class from the normal data as the majority class, are one representative solution for class imbalanced datasets. Since one-class classifiers are trained using only normal data to create a decision boundary for later anomaly detection, the quality of the training set, i.e. the majority class, is one key factor that affects the performance of one-class classifiers.Design/methodology/approachIn this paper, we focus on two data cleaning or preprocessing methods to address class imbalanced datasets. The first method examines whether performing instance selection to remove some noisy data from the majority class can improve the performance of one-class classifiers. The second method combines instance selection and missing value imputation, where the latter is used to handle incomplete datasets that contain missing values.FindingsThe experimental results are based on 44 class imbalanced datasets; three instance selection algorithms, including IB3, DROP3 and the GA, the CART decision tree for missing value imputation, and three one-class classifiers, which include OCSVM, IFOREST and LOF, show that if the instance selection algorithm is carefully chosen, performing this step could improve the quality of the training data, which makes one-class classifiers outperform the baselines without instance selection. Moreover, when class imbalanced datasets contain some missing values, combining missing value imputation and instance selection, regardless of which step is first performed, can maintain similar data quality as datasets without missing values.Originality/valueThe novelty of this paper is to investigate the effect of performing instance selection on the performance of one-class classifiers, which has never been done before. Moreover, this study is the first attempt to consider the scenario of missing values that exist in the training set for training one-class classifiers. In this case, performing missing value imputation and instance selection with different orders are compared.

Publisher

Emerald

Subject

Library and Information Sciences,Information Systems

Reference39 articles.

1. Instance-based learning algorithms;Machine Learning,1991

2. Framework for extreme imbalance classification-SWIM—sampling with the majority class;Knowledge and Information Systems,2020

3. A survey of predictive modeling on imbalanced domains;ACM Computing Surveys,2016

4. LOF: identifying density-based local outliers;SIGMOD Record,2000

5. Using evolutionary algorithms as instance selection for data reduction: an experimental study;IEEE Transactions on Evolutionary Computation,2003

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A hybridization of multiple imputation and one-class bagging ensemble approach for missing value and class imbalance problem;Evolving Systems;2024-07-13

2. Majority re-sampling via sub-class clustering for imbalanced datasets;Journal of Experimental & Theoretical Artificial Intelligence;2023-01-10

3. Unsupervised instance selection via conjectural hyperrectangles;Neural Computing and Applications;2022-11-02

4. Sınıflar Arası Kenar Payını Genişletmek İçin Yeni Bir Örnek Seçim Algoritması;Journal of Intelligent Systems: Theory and Applications;2022-09-01