Distributed big data analysis using spark parallel data processing-Reference-Cited by-同舟云学术

Distributed big data analysis using spark parallel data processing

Published:2022-06-01 Issue:3 Volume:11 Page:1505-1515
ISSN:2302-9285
Container-title:Bulletin of Electrical Engineering and Informatics
language:
Short-container-title:Bulletin EEI

Author:

Omar Hoger Khayrolla^ORCID,Jumaa Alaa Khalil^ORCID

Abstract

Nowadays, the big data marketplace is rising rapidly. The big challenge is finding a system that can store and handle a huge size of data and then processing that huge data for mining the hidden knowledge. This paper proposed a comprehensive system that is used for improving big data analysis performance. It contains a fast big data processing engine using Apache Spark and a big data storage environment using Apache Hadoop. The system tests about 11 Gigabytes of text data which are collected from multiple sources for sentiment analysis. Three different machine learning (ML) algorithms are used in this system which is already supported by the Spark ML package. The system programs were written in Java and Scala programming languages and the constructed model consists of the classification algorithms as well as the pre-processing steps in a figure of ML pipeline. The proposed system was implemented in both central and distributed data processing. Moreover, some datasets manipulation manners have been applied in the system tests to check which manner provides the best accuracy and time performance. The results showed that the system works efficiently for treating big data, it gains excellent accuracy with fast execution time especially in the distributed data nodes.

Publisher

Institute of Advanced Engineering and Science

Subject

Electrical and Electronic Engineering,Control and Optimization,Computer Networks and Communications,Hardware and Architecture,Instrumentation,Information Systems,Control and Systems Engineering,Computer Science (miscellaneous)

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Spark Analysis Technology in Engineering Project Management Support in Big Data Scenarios;2024 International Conference on Machine Intelligence and Digital Applications;2024-05-30

2. DPro-SM – A distributed framework for proactive straggler mitigation using LSTM;Heliyon;2024-01

3. Computationally Efficient Neural Rendering for Generator Adversarial Networks Using a Multi-GPU Cluster in a Cloud Environment;IEEE Access;2023

4. Frameworks, Applications and Challenges in Streaming Big Data Analytics: A Review;2022 3rd International Conference on Innovations in Computer Science & Software Engineering (ICONICS);2022-12-14

5. Distributed random vector functional link network with subspace-based local connections;Journal of Shenzhen University Science and Engineering;2022-11-01