Smart Data Driven Decision Trees Ensemble Methodology for Imbalanced Big Data-Reference-Cited by-同舟云学术

Smart Data Driven Decision Trees Ensemble Methodology for Imbalanced Big Data

Published:2024-05-31 Issue:4 Volume:16 Page:1572-1588
ISSN:1866-9956
Container-title:Cognitive Computation
language:en
Short-container-title:Cogn Comput

Author:

García-Gil Diego^ORCID,García Salvador,Xiong Ning,Herrera Francisco

Abstract

AbstractDifferences in data size per class, also known as imbalanced data distribution, have become a common problem affecting data quality. Big Data scenarios pose a new challenge to traditional imbalanced classification algorithms, since they are not prepared to work with such amount of data. Split data strategies and lack of data in the minority class due to the use of MapReduce paradigm have posed new challenges for tackling the imbalance between classes in Big Data scenarios. Ensembles have been shown to be able to successfully address imbalanced data problems. Smart Data refers to data of enough quality to achieve high-performance models. The combination of ensembles and Smart Data, achieved through Big Data preprocessing, should be a great synergy. In this paper, we propose a novel Smart Data driven Decision Trees Ensemble methodology for addressing the imbalanced classification problem in Big Data domains, namely SD_DeTE methodology. This methodology is based on the learning of different decision trees using distributed quality data for the ensemble process. This quality data is achieved by fusing random discretization, principal components analysis, and clustering-based random oversampling for obtaining different Smart Data versions of the original data. Experiments carried out in 21 binary adapted datasets have shown that our methodology outperforms random forest.

Funder

Spanish National Research Project

Swedish Research Council

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s12559-024-10295-z.pdf

Reference53 articles.

1. Bansal M, Chana I, Clarke S. A survey on IoT Big data: Current status, 13 V’s challenges, and future directions. ACM Comput Surv. 53(6).

2. Kaiser MS, Zenia N, Tabassum F, Mamun SA, Rahman MA, Islam MS, Mahmud M. 6G access network for intelligent Internet of Healthcare Things: Opportunity, challenges, and research directions. In: Kaiser MS, Bandyopadhyay A, Mahmud M, Ray K, editors. Proceedings of International Conference on Trends in Computational and Cognitive Engineering. Singapore: Springer Singapore; 2021. pp. 317–28.

3. Ramírez-Gallego S, Fernández A, García S, Chen M, Herrera F. Big data: Tutorial and guidelines on information and process fusion for analytics algorithms with MapReduce. Inf Fusion. 2018;42:51–61.

4. Khan N, Naim A, Hussain MR, Naveed QN, Ahmad N, Qamar S. The 51 v’s of big data: Survey, technologies, characteristics, opportunities, issues and challenges. In: Proceedings of the International Conference on Omni-Layer Intelligent Systems, COINS ’19. New York: Association for Computing Machinery; 2019. pp. 19–24.

5. Ge M, Bangui H, Buhnova B. Big data for Internet of Things: a survey. Futur Gener Comput Syst. 2018;87:601–14.