A Spark-Based Artificial Bee Colony Algorithm for Unbalanced Large Data Classification-Reference-Cited by-同舟云学术

A Spark-Based Artificial Bee Colony Algorithm for Unbalanced Large Data Classification

Published:2022-11-08 Issue:11 Volume:13 Page:530
ISSN:2078-2489
Container-title:Information
language:en
Short-container-title:Information

Author:

Al-Sawwa Jamil^ORCID,Almseidin Mohammad^ORCID

Abstract

With the rapid development of internet technology, the amount of collected or generated data has increased exponentially. The sheer volume, complexity, and unbalanced nature of this data pose a challenge to the scientific community to extract meaningful information from this data within a reasonable time. In this paper, we implemented a scalable design of an artificial bee colony for big data classification using Apache Spark. In addition, a new fitness function is proposed to handle unbalanced data. Two experiments were performed using the real unbalanced datasets to assess the performance and scalability of our proposed algorithm. The performance results reveal that our proposed fitness function can efficiently deal with unbalanced datasets and statistically outperforms the existing fitness function in terms of G-mean and F1-Score. In additon, the scalability results demonstrate that our proposed Spark-based design obtained outstanding speedup and scaleup results that are very close to optimal. In addition, our Spark-based design scales efficiently with increasing data size.

Publisher

MDPI AG

Subject

Information Systems

Link

https://www.mdpi.com/2078-2489/13/11/530/pdf

Reference28 articles.

1. Sayad, S. (2011). Real Time Data Mining, Self-Help Publishers.

2. Learning from imbalanced data: Open challenges and future directions;Prog. Artif. Intell.,2016

3. (2021, December 24). Spark 2.1.0 Documentation. Available online: https://spark.apache.org/docs/2.1.0/.

4. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., and Stoica, I. (2012, January 25–27). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), San Jose, CA, USA.

5. (2021, December 04). Apache Hadoop- MapReduce. Available online: https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html.

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Comprehensive Review of Metaheuristic Algorithms (MAs) for Optimal Control (OCl) Improvement;Archives of Computational Methods in Engineering;2024-01-31

2. A Hybrid Optimization Driven Deep Residual Network for Sybil Attack Detection and Avoidance in Wireless Sensor Networks;Communications in Computer and Information Science;2024