Author:
Leng Qiangkui,Guo Jiamei,Tao Jiaqing,Meng Xiangfu,Wang Changzhong
Abstract
AbstractMitigating the impact of class imbalance datasets on classifiers poses a challenge to the machine learning community. Conventional classifiers do not perform well as they are habitually biased toward the majority class. Among existing solutions, the synthetic minority oversampling technique (SMOTE) has shown great potential, aiming to improve the dataset rather than the classifier. However, SMOTE still needs improvement because of its equal oversampling to each minority instance. Based on the consensus that instances far from the borderline contribute less to classification, a refined method for oversampling borderline minority instances (OBMI) is proposed in this paper using a two-stage Tomek link-finding procedure. In the oversampling stage, the pairs of between-class instances nearest to each other are first found to form Tomek links. Then, these minority instances in Tomek links are extracted as base instances. Finally, new minority instances are generated, each of which is linearly interpolated between a base instance and one minority neighbor of the base instance. To address the overlap caused by oversampling, in the cleaning stage, Tomek links are employed again to remove the borderline instances from both classes. The OBMI is compared with ten baseline methods on 17 benchmark datasets. The results show that it performs better on most of the selected datasets in terms of the F1-score and G-mean. Statistical analysis also indicates its higher-level Friedman ranking.
Funder
National Natural Science Foundation of China
PhD Startup Foundation of Liaoning Technical University
Publisher
Springer Science and Business Media LLC
Reference55 articles.
1. Lu Y, Cheung YM, Tang YY (2019) Bayes imbalance impact index: a measure of class imbalanced data set for classification problem. IEEE Trans Neural Netw Learn Syst 31(9):3525–3539
2. Wang Q, Zhou Y, Zhang W, Tang Z, Chen X (2020) Adaptive sampling using self-paced learning for imbalanced cancer data pre-diagnosis. Expert Syst Appl 152:113334
3. Shen F, Zhao X, Kou G, Alsaadi EE (2021) A new deep learning ensemble credit risk evaluation model with an improved synthetic minority oversampling technique. Appl Soft Comput 98:106852
4. Azaria A, Richardson A, Kraus S, Subrahmanian VS (2014) Behavioral analysis of insider threat: a survey and bootstrapped prediction in imbalanced data. IEEE Trans Comput Soc Syst 1(2):135–155
5. Xu Z, Shen D, Nie T, Kou Y (2020) A hybrid sampling algorithm combining M-SMOTE and ENN based on random forest for medical imbalanced data. J Biomed Inform 107:103465