Optimal selection of resampling methods for imbalanced data with high complexity-Reference-Cited by-同舟云学术

Optimal selection of resampling methods for imbalanced data with high complexity

Published:2023-07-27 Issue:7 Volume:18 Page:e0288540
ISSN:1932-6203
Container-title:PLOS ONE
language:en
Short-container-title:PLoS ONE

Author:

Kim Annie,Jung Inkyung^ORCID

Abstract

Class imbalance is a major problem in classification, wherein the decision boundary is easily biased toward the majority class. A data-level solution (resampling) is one possible solution to this problem. However, several studies have shown that resampling methods can deteriorate the classification performance. This is because of the overgeneralization problem, which occurs when samples produced by the oversampling technique that should be represented in the minority class domain are introduced into the majority-class domain. This study shows that the overgeneralization problem is aggravated in complex data settings and introduces two alternate approaches to mitigate it. The first approach involves incorporating a filtering method into oversampling. The second approach is to apply undersampling. The main objective of this study is to provide guidance on selecting optimal resampling methods in imbalanced and complex datasets to improve classification performance. Simulation studies and real data analyses were performed to compare the resampling results in various scenarios with different complexities, imbalances, and sample sizes. In the case of noncomplex datasets, undersampling was found to be optimal. However, in the case of complex datasets, applying a filtering method to delete misallocated examples was optimal. In conclusion, this study can aid researchers in selecting the optimal method for resampling complex datasets.

Publisher

Public Library of Science (PLoS)

Subject

Multidisciplinary

Reference39 articles.

1. Classification of imbalanced data: A review;Y Sun;International journal of pattern recognition and artificial intelligence,2009

2. SMOTE: synthetic minority over-sampling technique;NV Chawla;Journal of artificial intelligence research,2002

3. Using unsupervised learning to guide resampling in imbalanced data sets;A Nickerson;InInternational Workshop on Artificial Intelligence and Statistics,2001

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Machine learning applied to the prediction of relapse, hospitalization, and suicide in bipolar disorder using neuroimaging and clinical data: A systematic review;Journal of Affective Disorders;2024-09

2. Identifying Key Learning Algorithm Parameter of Forward Feature Selection to Integrate with Ensemble Learning for Customer Churn Prediction;VFAST Transactions on Software Engineering;2024-06-11