Investigating class rarity in big data-Reference-Cited by-同舟云学术

Investigating class rarity in big data

Published:2020-03-16 Issue:1 Volume:7 Page:
ISSN:2196-1115
Container-title:Journal of Big Data
language:en
Short-container-title:J Big Data

Author:

Hasanin Tawfiq,Khoshgoftaar Taghi M.,Leevy Joffrey L.,Bauder Richard A.

Abstract

AbstractIn Machine Learning, if one class has a significantly larger number of instances (majority) than the other (minority), this condition is defined as class imbalance. With regard to datasets, class imbalance can bias the predictive capabilities of Machine Learning algorithms towards the majority (negative) class, and in situations where false negatives incur a greater penalty than false positives, this imbalance may lead to adverse consequences. Our paper incorporates two case studies, each utilizing a unique approach of three learners (gradient-boosted trees, logistic regression, random forest) and three performance metrics (Area Under the Receiver Operating Characteristic Curve, Area Under the Precision-Recall Curve, Geometric Mean) to investigate class rarity in big data. Class rarity, a notably extreme degree of class imbalance, was effected in our experiments by randomly removing minority (positive) instances to artificially generate eight subsets of gradually decreasing positive class instances. All model evaluations were performed through Cross-Validation. In the first case study, which uses a Medicare Part B dataset, performance scores for the learners generally improve with the Area Under the Receiver Operating Characteristic Curve metric as the rarity level decreases, while corresponding scores with the Area Under the Precision-Recall Curve and Geometric Mean metrics show no improvement. In the second case study, which uses a dataset built from Distributed Denial of Service attack attack data (POSTSlowloris Combined), the Area Under the Receiver Operating Characteristic Curve metric produces very high-performance scores for the learners, with all subsets of positive class instances. For the second study, scores for the learners generally improve with the Area Under the Precision-Recall Curve and Geometric Mean metrics as the rarity level decreases. Overall, with regard to both case studies, the Gradient-Boosted Trees (GBT) learner performs the best.

Publisher

Springer Science and Business Media LLC

Subject

Information Systems and Management,Computer Networks and Communications,Hardware and Architecture,Information Systems

Link

http://link.springer.com/content/pdf/10.1186/s40537-020-00301-0.pdf

Reference39 articles.

1. Katal A, Wazid M, Goudar R. Big data: issues, challenges, tools and good practices. In: 2013 Sixth International Conference on contemporary computing (IC3). NewYork: IEEE; 2013. p. 404–409.

2. Leevy JL, Khoshgoftaar TM, Bauder RA, Seliya N. A survey on addressing high-class imbalance in big data. J Big Data. 2018;5(1):42.

3. Soltysik RC, Yarnold PR. Megaoda large sample and big data time trials: separating the chaff. Optim Data Anal. 2013;2:194–7.

4. Cao M, Chychyla R, Stewart T. Big data analytics in financial statement audits. Account Horizons. 2015;29(2):423–9.

5. Bauder RA, Khoshgoftaar TM, Hasanin, T. An empirical study on class rarity in big data. In: 2018 17th IEEE International Conference on machine learning and applications (ICMLA). Newyork: IEEE ; 2018. p. 785–790. IEEE

Cited by 17 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A Novel Supervised-Unsupervised Approach for Past-Due Prediction;Risk Management Magazine;2024-08

2. Data-Centric Solutions for Addressing Big Data Veracity with Class Imbalance, High Dimensionality, and Class Overlapping;Applied Sciences;2024-07-04

3. Improving Unbalanced Security X-Ray Image Classification Using VGG16 and AlexNet with Z-Score Normalization and Augmentation;Lecture Notes in Electrical Engineering;2024

4. Detecting malicious IoT traffic using Machine Learning techniques;Revista Română de Informatică și Automatică;2023-12-13

5. Investigating the effectiveness of one-class and binary classification for fraud detection;Journal of Big Data;2023-10-12