Affiliation:
1. University of Ghana, Legon, Accra-Ghana
2. Accra Technical University, Accra-Ghana
3. University of Ghana, Accra-Ghana
Abstract
Class imbalance problem is prevalent in many real-world domains. It has become an active area of research. In binary classification problems, imbalance learning refers to learning from a dataset with a high degree of skewness to the negative class. This phenomenon causes classification algorithms to perform woefully when predicting positive classes with new examples. Data resampling, which involves manipulating the training data before applying standard classification techniques, is among the most commonly used techniques to deal with the class imbalance problem. This article presents a new hybrid sampling technique that improves the overall performance of classification algorithms for solving the class imbalance problem significantly. The proposed method called the Hybrid Cluster-Based Undersampling Technique (HCBST) uses a combination of the cluster undersampling technique to under-sample the majority instances and an oversampling technique derived from Sigma Nearest Oversampling based on Convex Combination, to oversample the minority instances to solve the class imbalance problem with a high degree of accuracy and reliability. The performance of the proposed algorithm was tested using 11 datasets from the National Aeronautics and Space Administration Metric Data Program data repository and University of California Irvine Machine Learning data repository with varying degrees of imbalance. Results were compared with classification algorithms such as the K-nearest neighbours, support vector machines, decision tree, random forest, neural network, AdaBoost, naïve Bayes, and quadratic discriminant analysis. Tests results revealed that for the same datasets, the HCBST performed better with average performances of 0.73, 0.67, and 0.35 in terms of performance measures of area under curve, geometric mean, and Matthews Correlation Coefficient, respectively, across all the classifiers used for this study. The HCBST has the potential of improving the performance of the class imbalance problem, which by extension, will improve on the various applications that rely on the concept for a solution.
Publisher
Association for Computing Machinery (ACM)
Reference49 articles.
1. Clustering algorithms on imbalanced data using the SMOTE technique for image segmentation
2. Classification with class imbalance problem: A review;Ali Aida;Int. J. Adv. Soft Compu. Appl,2015
3. UCI machine learning repository: Data sets;Asuncion Arthur;Univ. Calif. Irvine Sch. Inf.,2007
4. A study of the behavior of several methods for balancing machine learning training data
5. SMOTE for high-dimensional class-imbalanced data;Blagus Rok;BMC Bioinformatics,2013
Cited by
12 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献