Learning from Highly Imbalanced Big Data with Label Noise-Reference-Cited by-同舟云学术

Learning from Highly Imbalanced Big Data with Label Noise

Published:2023-08 Issue:05 Volume:32 Page:
ISSN:0218-2130
Container-title:International Journal on Artificial Intelligence Tools
language:en
Short-container-title:Int. J. Artif. Intell. Tools

Author:

Johnson Justin M.¹^ORCID,Kennedy Robert K. L.¹,Khoshgoftaar Taghi M.¹

Affiliation:

1. College of Engineering and Computer Science, Florida Atlantic University, Boca Raton, Florida 33431, United States

Abstract

This study explores the effects of class label noise on detecting fraud within three highly imbalanced healthcare fraud data sets containing millions of claims and minority class sizes as small as 0.1%. For each data set, 29 noise distributions are simulated by varying the level of class noise and the distribution of noise between the fraudulent and non-fraudulent classes. Four popular machine learning algorithms are evaluated on each noise distribution using six rounds of five-fold cross-validation. Performance is measured using the area under the precision-recall curve (AUPRC), true positive rate (TPR), and true negative rate (TNR) in order to understand the effect of the noise level, noise distribution, and their interactions. AUPRC results show that negative class noise, i.e. fraudulent samples incorrectly labeled as non-fraudulent, is the most detrimental to model performance. TPR and TNR results show that there are significant trade-offs in class-wise performance as noise transitions between the positive and the negative class. Finally, results reveal how overfitting negatively impacts the classification performance of some learners, and how simple regularization can be used to combat this overfitting and improve classification performance across all noise distributions.

Publisher

World Scientific Pub Co Pte Ltd

Subject

Artificial Intelligence,General Medicine

Link

https://www.worldscientific.com/doi/pdf/10.1142/S0218213023600035

Reference49 articles.

1. A survey on addressing high-class imbalance in big data

2. Classification in the Presence of Label Noise: A Survey

3. An empirical study of the classification performance of learners on imbalanced and noisy software quality data

4. A study on rare fraud predictions with big Medicare claims fraud data

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A novel reinforcement learning-based hybrid intrusion detection system on fog-to-cloud computing;The Journal of Supercomputing;2024-08-20