Using Area Under the Precision Recall Curve to Assess the Effect of Random Undersampling in the Classification of Imbalanced Medicare Big Data
-
Published:2023-12-29
Issue:
Volume:
Page:
-
ISSN:0218-5393
-
Container-title:International Journal of Reliability, Quality and Safety Engineering
-
language:en
-
Short-container-title:Int. J. Rel. Qual. Saf. Eng.
Author:
Hancock III John T.1,
Khoshgoftaar Taghi M.1,
Johnson Justin M.1
Affiliation:
1. Department of Electrical Engineering and Computer Science, Florida Atlantic University, 777 Glades Road, Boca Raton, FL 33431, USA
Abstract
In this paper, we investigate the impact of Random Undersampling (RUS) on a supervised Machine Learning task involving highly imbalanced Big Data. We present the results of experiments in Medicare Fraud detection. To the best of our knowledge, these experiments are conducted with the largest insurance claims datasets ever used for Medicare Fraud detection. We obtain two datasets from two Big Data repositories provided by the United States government’s Centers for Medicare and Medicaid Services. The larger of the two datasets contains nearly 174 million instances, with a minority to majority class ratio of approximately 0.0039. Our contribution is to show that RUS has a detrimental effect on a Medicare Fraud detection task when performed on large scale, imbalanced data. The effect of RUS is apparent in the Area Under the Precision Recall Curve (AUPRC) scores recorded from experimental outcomes. We use four popular, open-source classifiers in our experiments to confirm the negative impact of RUS on their AUPRC scores.
Publisher
World Scientific Pub Co Pte Ltd
Subject
Electrical and Electronic Engineering,Industrial and Manufacturing Engineering,Energy Engineering and Power Technology,Aerospace Engineering,Safety, Risk, Reliability and Quality,Nuclear Energy and Engineering,General Computer Science
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献