Bridging the gap between HPC and big data frameworks-Reference-Cited by-同舟云学术

Bridging the gap between HPC and big data frameworks

Published:2017-04 Issue:8 Volume:10 Page:901-912
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Anderson Michael¹,Smith Shaden²,Sundaram Narayanan¹,Capotă Mihai¹,Zhao Zheguang³,Dulloor Subramanya⁴,Satish Nadathur¹,Willke Theodore L.¹

Affiliation:

1. Parallel Computing Lab

2. University of Minnesota

3. Brown University

4. Intel Corporation

Abstract

Apache Spark is a popular framework for data analytics with attractive features such as fault tolerance and interoperability with the Hadoop ecosystem. Unfortunately, many analytics operations in Spark are an order of magnitude or more slower compared to native implementations written with high performance computing tools such as MPI. There is a need to bridge the performance gap while retaining the benefits of the Spark ecosystem such as availability, productivity, and fault tolerance. In this paper, we propose a system for integrating MPI with Spark and analyze the costs and benefits of doing so for four distributed graph and machine learning applications. We show that offloading computation to an MPI environment from within Spark provides 3.1−17.7× speedups on the four sparse applications, including all of the overheads. This opens up an avenue to reuse existing MPI libraries in Spark with little effort.

Publisher

VLDB Endowment

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3090163.3090168

Cited by 37 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. CLIC: An Extensible and Efficient Cross-Platform Data Analytics System;IEEE Transactions on Parallel and Distributed Systems;2024-01

2. Communication-Avoiding Recursive Aggregation;2023 IEEE International Conference on Cluster Computing (CLUSTER);2023-10-31

3. High-Performance Computation in Big Data Analytics;Intelligent Systems Design and Applications;2023

4. DSParLib: A C++ Template Library for Distributed Stream Parallelism;International Journal of Parallel Programming;2022-10-29

5. A unified framework to improve the interoperability between HPC and Big Data languages and programming models;Future Generation Computer Systems;2022-09