A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench-Reference-Cited by-同舟云学术

A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench

Published:2020-12 Issue:1 Volume:7 Page:
ISSN:2196-1115
Container-title:Journal of Big Data
language:en
Short-container-title:J Big Data

Author:

Ahmed N.^ORCID,Barczak Andre L. C.^ORCID,Susnjak Teo^ORCID,Rashid Mohammed A.^ORCID

Abstract

AbstractBig Data analytics for storing, processing, and analyzing large-scale datasets has become an essential tool for the industry. The advent of distributed computing frameworks such as Hadoop and Spark offers efficient solutions to analyze vast amounts of data. Due to the application programming interface (API) availability and its performance, Spark becomes very popular, even more popular than the MapReduce framework. Both these frameworks have more than 150 parameters, and the combination of these parameters has a massive impact on cluster performance. The default system parameters help the system administrator deploy their system applications without much effort, and they can measure their specific cluster performance with factory-set parameters. However, an open question remains: can new parameter selection improve cluster performance for large datasets? In this regard, this study investigates the most impacting parameters, under resource utilization, input splits, and shuffle, to compare the performance between Hadoop and Spark, using an implemented cluster in our laboratory. We used a trial-and-error approach for tuning these parameters based on a large number of experiments. In order to evaluate the frameworks of comparative analysis, we select two workloads: WordCount and TeraSort. The performance metrics are carried out based on three criteria: execution time, throughput, and speedup. Our experimental results revealed that both system performances heavily depends on input data size and correct parameter selection. The analysis of the results shows that Spark has better performance as compared to Hadoop when data sets are small, achieving up to two times speedup in WordCount workloads and up to 14 times in TeraSort workloads when default parameter values are reconfigured.

Publisher

Springer Science and Business Media LLC

Subject

Information Systems and Management,Computer Networks and Communications,Hardware and Architecture,Information Systems

Link

http://link.springer.com/content/pdf/10.1186/s40537-020-00388-5.pdf

Reference38 articles.

1. Apache Hadoop Documentation 2014. http://hadoop.apache.org/. Accessed 15 July 2020.

2. Verma A, Mansuri AH, Jain N. Big data management processing with hadoop mapreduce and spark technology: A comparison. In: 2016 symposium on colossal data analysis and networking (CDAN). New York: IEEE; 2016. p. 1–4.

3. Management Association IR. Big Data: concepts, methodologies, tools, and applications. Hershey: IGI Global; 2016.

4. Zaharia M, Chowdhury M, Das T, Dave A, Ma J, Mccauley M, Franklin M, Shenker S, Stoica I. Fast and interactive analytics over hadoop data with spark. Usenix Login. 2012;37:45–51.

5. Dean J, Ghemawat S. Mapreduce: simplified data processing on large clusters. Commun ACM. 2008;51(1):107–13.

Cited by 54 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Artificial Intelligence Applications in Smart Healthcare: A Survey;Future Internet;2024-08-27

2. A novel framework for generic Spark workload characterization and similar pattern recognition using machine learning;Journal of Parallel and Distributed Computing;2024-07

3. Identification of Influential Nodes in Social Network: Big Data - Hadoop;International Journal of Data Science;2024-06-30

4. Comparative Analysis of Hadoop and Spark Performance for Real-time Big Data Smart Platforms Utilizing IoT Technology in Electrical Facilities;Journal of Electrical Engineering & Technology;2024-06-01

5. Cellular automata-based MapReduce design: Migrating a big data processing model from Industry 4.0 to Industry 5.0;e-Prime - Advances in Electrical Engineering, Electronics and Energy;2024-06