SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data-Reference-Cited by-同舟云学术

SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data

Published:2018-12-12 Issue:03 Volume:18 Page:e23
ISSN:1666-6038
Container-title:Journal of Computer Science and Technology
language:
Short-container-title:JC&ST

Author:

Basgall María José^ORCID,Hasperué Waldo,Naiouf Marcelo,Fernández Alberto,Herrera Francisco

Abstract

The volume of data in today's applications has meant a change in the way Machine Learning issues are addressed. Indeed, the Big Data scenario involves scalability constraints that can only be achieved through intelligent model design and the use of distributed technologies. In this context, solutions based on the Spark platform have established themselves as a de facto standard. In this contribution, we focus on a very important framework within Big Data Analytics, namely classification with imbalanced datasets. The main characteristic of this problem is that one of the classes is underrepresented, and therefore it is usually more complex to find a model that identifies it correctly. For this reason, it is common to apply preprocessing techniques such as oversampling to balance the distribution of examples in classes. In this work we present SMOTE-BD, a fully scalable preprocessing approach for imbalanced classification in Big Data. It is based on one of the most widespread preprocessing solutions for imbalanced classification, namely the SMOTE algorithm, which creates new synthetic instances according to the neighborhood of each example of the minority class. Our novel development is made to be independent of the number of partitions or processes created to achieve a higher degree of efficiency. Experiments conducted on different standard and Big Data datasets show the quality of the proposed design and implementation.

Publisher

Universidad Nacional de La Plata

Subject

Artificial Intelligence,Computer Science Applications,Computer Vision and Pattern Recognition,Hardware and Architecture,Computer Science (miscellaneous),Software

Cited by 14 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A hybrid artificial intelligence algorithm for fault diagnosis of hot rolled strip crown imbalance;Engineering Applications of Artificial Intelligence;2024-04

2. Adversarial Approaches to Tackle Imbalanced Data in Machine Learning;Sustainability;2023-04-24

3. Analysis and design of scalable pre-processing techniques of instances for imbalanced Big Data problems. Applications in humanitarian emergencies situations.;Journal of Computer Science and Technology;2022-10-17

4. Review on the Application of Big Data Algorithms to Understand a Pandemic Virus;Handbook of Research on Applied Artificial Intelligence and Robotics for Government Processes;2022-09-16

5. Self-boosted with dynamic semi-supervised clustering method for imbalanced big data classification;Multimedia Tools and Applications;2022-05-20