A Method for Analyzing the Performance Impact of Imbalanced Binary Data on Machine Learning Models-Reference-Cited by-同舟云学术

A Method for Analyzing the Performance Impact of Imbalanced Binary Data on Machine Learning Models

Published:2022-11-01 Issue:11 Volume:11 Page:607
ISSN:2075-1680
Container-title:Axioms
language:en
Short-container-title:Axioms

Author:

Zheng Ming^ORCID,Wang Fei,Hu Xiaowen,Miao Yuhao,Cao Huo,Tang Mingjing

Abstract

Machine learning models may not be able to effectively learn and predict from imbalanced data in the fields of machine learning and data mining. This study proposed a method for analyzing the performance impact of imbalanced binary data on machine learning models. It systematically analyzes 1. the relationship between varying performance in machine learning models and imbalance rate (IR); 2. the performance stability of machine learning models on imbalanced binary data. In the proposed method, the imbalanced data augmentation algorithms are first designed to obtain the imbalanced dataset with gradually varying IR. Then, in order to obtain more objective classification results, the evaluation metric AFG, arithmetic mean of area under the receiver operating characteristic curve (AUC), F-measure and G-mean are used to evaluate the classification performance of machine learning models. Finally, based on AFG and coefficient of variation (CV), the performance stability evaluation method of machine learning models is proposed. Experiments of eight widely used machine learning models on 48 different imbalanced datasets demonstrate that the classification performance of machine learning models decreases with the increase of IR on the same imbalanced data. Meanwhile, the classification performances of LR, DT and SVC are unstable, while GNB, BNB, KNN, RF and GBDT are relatively stable and not susceptible to imbalanced data. In particular, the BNB has the most stable classification performance. The Friedman and Nemenyi post hoc statistical tests also confirmed this result. The SMOTE method is used in oversampling-based imbalanced data augmentation, and determining whether other oversampling methods can obtain consistent results needs further research. In the future, an imbalanced data augmentation algorithm based on undersampling and hybrid sampling should be used to analyze the performance impact of imbalanced binary data on machine learning models.

Funder

Major Project of Natural Science Research in Colleges and Universities of Anhui Province

2021 cultivation project of Anhui Normal University

Wuhu Science and Technology Bureau Project

Publisher

MDPI AG

Subject

Geometry and Topology,Logic,Mathematical Physics,Algebra and Number Theory,Analysis

Link

https://www.mdpi.com/2075-1680/11/11/607/pdf

Reference36 articles.

1. Multiset feature learning for highly imbalanced data classification;Jing;IEEE Trans. Pattern Anal. Mach. Intell.,2020

2. Conditional Wasserstein generative adversarial network-gradient penalty-based approach to alleviating imbalanced data classification;Zheng;Inf. Sci.,2020

3. UFFDFR: Undersampling framework with denoising, fuzzy c-means clustering, and representative sample selection for imbalanced data classification;Zheng;Inf. Sci.,2021

4. Exploring ensemble oversampling method for imbalanced keyword extraction learning in policy text based on three-way decisions and SMOTE;Liang;Expert Syst. Appl.,2022

5. Hybrid neural network with cost-sensitive support vector machine for class-imbalanced multimodal data;Kim;Neural Netw.,2020

Cited by 11 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. R-WDLS: An efficient security region oversampling technique based on data distribution;Applied Soft Computing;2024-03

2. Gene Expression-Based Cancer Classification for Handling the Class Imbalance Problem and Curse of Dimensionality;International Journal of Molecular Sciences;2024-02-09

3. Handling Imbalanced Datasets in Software Refactoring Prediction;Communications in Computer and Information Science;2024

4. Class imbalance and its impact on predictive models for binary classification of disease: a comparative analysis;Artificial Intelligence and Image Processing in Medical Imaging;2024

5. Coronary Heart Disease Prediction Through Machine Learning Using Non-Laboratory Risk Factors;2023 IEEE International Conference on Communication, Networks and Satellite (COMNETSAT);2023-11-23