Severely imbalanced Big Data challenges: investigating data sampling approaches-Reference-Cited by-同舟云学术

Severely imbalanced Big Data challenges: investigating data sampling approaches

Published:2019-11-30 Issue:1 Volume:6 Page:
ISSN:2196-1115
Container-title:Journal of Big Data
language:en
Short-container-title:J Big Data

Author:

Hasanin Tawfiq,Khoshgoftaar Taghi M.,Leevy Joffrey L.^ORCID,Bauder Richard A.

Abstract

AbstractSevere class imbalance between majority and minority classes in Big Data can bias the predictive performance of Machine Learning algorithms toward the majority (negative) class. Where the minority (positive) class holds greater value than the majority (negative) class and the occurrence of false negatives incurs a greater penalty than false positives, the bias may lead to adverse consequences. Our paper incorporates two case studies, each utilizing three learners, six sampling approaches, two performance metrics, and five sampled distribution ratios, to uniquely investigate the effect of severe class imbalance on Big Data analytics. The learners (Gradient-Boosted Trees, Logistic Regression, Random Forest) were implemented within the Apache Spark framework. The first case study is based on a Medicare fraud detection dataset. The second case study, unlike the first, includes training data from one source (SlowlorisBig Dataset) and test data from a separate source (POST dataset). Results from the Medicare case study are not conclusive regarding the best sampling approach using Area Under the Receiver Operating Characteristic Curve and Geometric Mean performance metrics. However, it should be noted that the Random Undersampling approach performs adequately in the first case study. For the SlowlorisBig case study, Random Undersampling convincingly outperforms the other five sampling approaches (Random Oversampling, Synthetic Minority Over-sampling TEchnique, SMOTE-borderline1 , SMOTE-borderline2 , ADAptive SYNthetic) when measuring performance with Area Under the Receiver Operating Characteristic Curve and Geometric Mean metrics. Based on its classification performance in both case studies, Random Undersampling is the best choice as it results in models with a significantly smaller number of samples, thus reducing computational burden and training time.

Publisher

Springer Science and Business Media LLC

Subject

Information Systems and Management,Computer Networks and Communications,Hardware and Architecture,Information Systems

Link

http://link.springer.com/content/pdf/10.1186/s40537-019-0274-4.pdf

Reference49 articles.

1. Kaisler S, Armour F, Espinosa JA, Money W. Big Data: issues and challenges moving forward. In: 2013 46th Hawaii international conference on system sciences. IEEE; 2013. p. 995–1004.

2. Datamation: Big Data Trends. https://www.datamation.com/big-data/big-data-trends.html

3. Senthilkumar S, Rai BK, Meshram AA, Gunasekaran A, Chandrakumarmangalam S. Big Data in healthcare management: a review of literature. Am J Theory Appl Bus. 2018;4:57–69.

4. Bauder RA, Khoshgoftaar TM, Hasanin T. An empirical study on class rarity in Big Data. In: 2018 17th IEEE international conference on machine learning and applications (ICMLA). IEEE; 2018. p. 785–90.

5. Leevy JL, Khoshgoftaar TM, Bauder RA, Seliya N. A survey on addressing high-class imbalance in Big Data. J Big Data. 2018;5(1):42.

Cited by 77 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A Novel Hybrid Resampling Approach to Address Class-Imbalanced Issues;SN Computer Science;2024-09-09

2. Predicting startup success using two bias-free machine learning: resolving data imbalance using generative adversarial networks;Journal of Big Data;2024-09-03

3. Prediksi Klasifikasi Kecelakaan Lalu Lintas di Kota Surakarta dengan Menggunakan Metode Regresi Logistik Multinomial;Sustainable Civil Building Management and Engineering Journal;2024-08-18

4. A synergistic fusion of shallow and deep generative model to enhance machine learning efficacy and classification performance in data-scarce environments;International Journal of Information Technology;2024-08-09

5. Enhancing Arabic Fake News Detection: Evaluating Data Balancing Techniques Across Multiple Machine Learning Models;Engineering, Technology & Applied Science Research;2024-08-02