Clash of the titans-Reference-Cited by-同舟云学术

Clash of the titans

Published:2015-09 Issue:13 Volume:8 Page:2110-2121
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Shi Juwei¹,Qiu Yunjie²,Minhas Umar Farooq³,Jiao Limei²,Wang Chen⁴,Reinwald Berthold³,Özcan Fatma³

Affiliation:

1. Renmin University of China

2. IBM Research, China

3. IBM Almaden Research Center

4. Tsinghua University

Abstract

MapReduce and Spark are two very popular open source cluster computing frameworks for large scale data analytics. These frameworks hide the complexity of task parallelism and fault-tolerance, by exposing a simple programming API to users. In this paper, we evaluate the major architectural components in MapReduce and Spark frameworks including: shuffle, execution model, and caching, by using a set of important analytic workloads. To conduct a detailed analysis, we developed two profiling tools: (1) We correlate the task execution plan with the resource utilization for both MapReduce and Spark, and visually present this correlation; (2) We provide a break-down of the task execution time for in-depth analysis. Through detailed experiments, we quantify the performance differences between MapReduce and Spark. Furthermore, we attribute these performance differences to different components which are architected differently in the two frameworks. We further expose the source of these performance differences by using a set of micro-benchmark experiments. Overall, our experiments show that Spark is about 2.5x, 5x, and 5x faster than MapReduce, for Word Count, k-means, and PageRank, respectively. The main causes of these speedups are the efficiency of the hash-based aggregation component for combine, as well as reduced CPU and disk overheads due to RDD caching in Spark. An exception to this is the Sort workload, for which MapReduce is 2x faster than Spark. We show that MapReduce's execution model is more efficient for shuffling data than Spark, thus making Sort run faster on MapReduce.

Publisher

VLDB Endowment

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/2831360.2831365

Cited by 138 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Cellular automata-based MapReduce design: Migrating a big data processing model from Industry 4.0 to Industry 5.0;e-Prime - Advances in Electrical Engineering, Electronics and Energy;2024-06

2. Data-centric workloads with MPI_Sort;Journal of Parallel and Distributed Computing;2024-05

3. Evaluation of distributed data processing frameworks in hybrid clouds;Journal of Network and Computer Applications;2024-04

4. A big data association rule mining based approach for energy building behaviour analysis in an IoT environment;Scientific Reports;2023-11-13

5. Lifting the Fog of Uncertainties;Proceedings of the 2023 ACM Symposium on Cloud Computing;2023-10-30