Data-Centric Solutions for Addressing Big Data Veracity with Class Imbalance, High Dimensionality, and Class Overlapping-Reference-Cited by-同舟云学术

Data-Centric Solutions for Addressing Big Data Veracity with Class Imbalance, High Dimensionality, and Class Overlapping

Published:2024-07-04 Issue:13 Volume:14 Page:5845
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Bolívar Armando¹^ORCID,García Vicente²^ORCID,Alejo Roberto³^ORCID,Florencia-Juárez Rogelio²^ORCID,Sánchez J. Salvador⁴^ORCID

Affiliation:

1. Instituto de Ingeniería y Tecnología, Universidad Autónoma de Ciudad Juárez, Av. del Charro 450 NTE., Ciudad Juárez 32310, Chihuahua, Mexico

2. División Multidisciplinaria en Ciudad Universitaria, Universidad Autónoma de Ciudad Juárez, Av. José de Jesús Delgado 18100, Ciudad Juárez 32579, Chihuahua, Mexico

3. Division of Postgraduate Studies and Research, Tecnológico Nacional de México, Instituto Tecnológico de Toluca, Av. Tecnológico s/n, Colonia Agrícola Bellavista, Metepec 52149, Estado de México, Mexico

4. Institute of New Imaging Technologies, Department of Computer Languages and Systems, Universitat Jaume I, Av. de Vicent Sos Baynat s/n, 12071 Castelló de la Plana, Spain

Abstract

An innovative strategy for organizations to obtain value from their large datasets, allowing them to guide future strategic actions and improve their initiatives, is the use of machine learning algorithms. This has led to a growing and rapid application of various machine learning algorithms with a predominant focus on building and improving the performance of these models. However, this data-centric approach ignores the fact that data quality is crucial for building robust and accurate models. Several dataset issues, such as class imbalance, high dimensionality, and class overlapping, affect data quality, introducing bias to machine learning models. Therefore, adopting a data-centric approach is essential to constructing better datasets and producing effective models. Besides data issues, Big Data imposes new challenges, such as the scalability of algorithms. This paper proposes a scalable hybrid approach to jointly addressing class imbalance, high dimensionality, and class overlapping in Big Data domains. The proposal is based on well-known data-level solutions whose main operation is calculating the nearest neighbor using the Euclidean distance as a similarity metric. However, these strategies may lose their effectiveness on datasets with high dimensionality. Hence, the data quality is achieved by combining a data transformation approach using fractional norms and SMOTE to obtain a balanced and reduced dataset. Experiments carried out on nine two-class imbalanced and high-dimensional large datasets showed that our scalable methodology implemented in Spark outperforms the traditional approach.

Funder

Google Cloud credits from the Google for Education program

Publisher

MDPI AG

Link

https://www.mdpi.com/2076-3417/14/13/5845/pdf

Reference48 articles.

1. Domo, I. (2024, May 10). Data Never Sleeps 11.0. Available online: https://www.domo.com/learn/infographic/data-never-sleeps-11.

2. Reinsel, D., Gantz, J., and Rydning, J. (2017). Data Age 2025: The Evolution of Data to Life-Critical. Don’t Focus on Big Data, Focus on the Data That’s Big; Technical Report; SEAGATE.

3. An overview of recent distributed algorithms for learning fuzzy models in Big Data classification;Ducange;J. Big Data,2020

4. Triguero, I., and Galar, M. (2024). Large-Scale Data Analytics with Python and Spark, Cambridge University Press.

5. Anjum, M., Min, H., and Ahmed, Z. (2024). Trivial State Fuzzy Processing for Error Reduction in Healthcare Big Data Analysis towards Precision Diagnosis. Bioengineering, 11.