Noise-Free Sampling with Majority for Imbalanced Classification Problem

Author:

Firdausanti Neni Alya1,Mendonça Israel1,Aritsugi Masayoshi1

Affiliation:

1. Kumamoto University

Abstract

Abstract Class imbalance has been widely accepted as a significant factor that negatively impacts a machine learning classifier's performance. One of the techniques to avoid this problem is to balance the data distribution by using sampling-based approaches, in which synthetic data is generated using the probability distribution of classes. However, this process is sensitive to the presence of noise in the data, in which the boundaries between the majority class and the minority class are blurred. Such phenomena shift the algorithm's decision boundary away from an ideal outcome. In this work, we propose a framework that tackles two primary objectives: first, to address class distribution imbalance by synthetically increasing the data of a minority class; and second, to devise an efficient noise reduction technique that improves the class balance algorithm. The proposed framework focuses its capability towards removing noisy elements from the majority class, and by doing so, provides more accurate information to the subsequent synthetic data generator algorithm. Experimental results show that our framework is capable of improving the prediction accuracy of eight classifiers from 7.78% up to 67.45% for eleven datasets tested.

Publisher

Research Square Platform LLC

Reference83 articles.

1. Salim Rezvani and Xizhao Wang (2023) A broad review on class imbalance learning techniques. Applied Soft Computing 143: 110415 https://doi.org/https://doi.org/10.1016/j.asoc.2023.110415, Algorithmic structures techniques, Data pre-processing techniques, Hybrid techniques, Imbalanced learning, Support vector machine, https://www.sciencedirect.com/science/article/pii/S1568494623004337, 1568-4946

2. Asniar and Nur Ulfa Maulidevi and Kridanto Surendro (2022) {SMOTE-LOF} for noise identification in imbalanced data classification. Journal of King Saud University - Computer and Information Sciences 34(6, Part B): 3413-3423 https://doi.org/https://doi.org/10.1016/j.jksuci.2021.01.014, Imbalanced data typically refers to a condition in which several data samples in a certain problem is not equally distributed, thereby leading to the underrepresentation of one or more classes in the dataset. These underrepresented classes are referred to as a minority, while the overrepresented ones are called the majority. The unequal distribution of data leads to the machine's inability to carry out predictive accuracy in determining the minority classes, thereby causing various costs of classification errors. Currently, the standard framework used to solve the unequal distribution of imbalanced data learning is the Synthetic Minority Oversampling Technique (SMOTE). However, SMOTE can produce synthetic minority data samples considered as noise, which is also part of the majority classes. Therefore, this study aims to improve SMOTE to identify the noise from synthetic minority data produced in handling imbalanced data by adding the Local Outlier Factor (LOF). The proposed method is called SMOTE-LOF, and the experiment was carried out using imbalanced datasets with the results compared with the performance of the SMOTE. The results showed that SMOTE-LOF produces better accuracy and f-measure than the SMOTE. In a dataset with a large number of data examples and a smaller imbalance ratio, the SMOTE-LOF approach also produced a better AUC than the SMOTE. However, for a dataset with a smaller number of data samples, the SMOTE's AUC result is arguably better at handling imbalanced data. Therefore, future research needs to be carried out using different datasets with combinations varying from the number of data samples and the imbalanced ratio., Imbalanced data, SMOTE, Noisy data, Outliers, Predictive accuracy, https://www.sciencedirect.com/science/article/pii/S1319157821000161, 1319-1578

3. Vimalraj S. Spelmen and R. Porkodi (2018) {A Review on Handling Imbalanced Data}. 1--11, March, 10.1109/ICCTCT.2018.8551020, 2018 International Conference on Current Trends towards Converging Technologies (ICCTCT)

4. Rekha, Gillala and Tyagi, Amit Kumar and Krishna Reddy, V. (2020) A Novel Approach to Solve Class Imbalance Problem Using Noise Filter Method. Springer International Publishing, Cham, 978-3-030-16657-1, Today's one of the popular pre-processing technique in handling class imbalance problems is over-sampling. It balances the datasets to achieve a high classification rate and also avoids the bias towards majority class samples. Over-sampling technique takes full minority samples in the training data into consideration while performing classification. But, the presence of some noise (in the minority samples and majority samples) may degrade the classification performance. Hence, this work introduces a noise filter over-sampling approach with Adaptive Boosting Algorithm (AdaBoost) for effective classification. This work evaluates the performance with the state of-the-art methods based on ensemble learning like AdaBoost, RUSBoost, SMOTEBoost on 14 imbalance binary class datasets with various Imbalance Ratios (IR). The experimental results show that our approach works as promising and effective for dealing with imbalanced datasets using metrics like F-Measure and AUC., 486--496, Intelligent Systems Design and Applications, Abraham, Ajith and Cherukuri, Aswani Kumar and Melin, Patricia and Gandhi, Niketa

5. Junnan Li and Qingsheng Zhu and Quanwang Wu and Zhu Fan (2021) A novel oversampling technique for class-imbalanced learning based on {SMOTE} and natural neighbors. Information Sciences 565: 438-455 https://doi.org/https://doi.org/10.1016/j.ins.2021.03.041, Developing techniques for the machine learning of a classifier from class-imbalanced data presents an important challenge. Among the existing methods for addressing this problem, SMOTE has been successful, has received great praise, and features an extensive range of practical applications. In this paper, we focus on SMOTE and its extensions, aiming to solve the most challenging issues, namely, the choice of the parameter k and the determination of the neighbor number of each sample. Hence, a synthetic minority oversampling technique with natural neighbors (NaNSMOTE) is proposed. In NaNSMOTE, the random difference between a selected base sample and one of its natural neighbors is used to generate synthetic samples. The main advantages of NaNSMOTE are that (a) it has an adaptive k value related to the data complexity; (b) samples of class centers have more neighbors to improve the generalization of synthetic samples, while border samples have fewer neighbors to reduce the error of synthetic samples; and (c) it can remove outliers. The effectiveness of NaNSMOTE is proven by comparing it with SMOTE and extended versions of SMOTE on real data sets., Class-imbalance learning, Oversampling, Classification, Supervised learning, nearest neighbors, Natural neighbors, https://www.sciencedirect.com/science/article/pii/S0020025521002863, 0020-0255

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3