Supercharging distributed computing environments for high-performance data engineering-Reference-Cited by-同舟云学术

Supercharging distributed computing environments for high-performance data engineering

Published:2024-07-12 Issue: Volume:2 Page:
ISSN:2813-7337
Container-title:Frontiers in High Performance Computing
language:
Short-container-title:Front. High Perform. Comput.

Author:

Perera Niranda,Sarker Arup Kumar,Shan Kaiying,Fetea Alex,Kamburugamuve Supun,Kanewala Thejaka Amila,Widanage Chathura,Staylor Mills,Zhong Tianle,Abeykoon Vibhatha,von Laszewski Gregor,Fox Geoffrey

Abstract

The data engineering and data science community has embraced the idea of using Python and R dataframes for regular applications. Driven by the big data revolution and artificial intelligence, these frameworks are now ever more important in order to process terabytes of data. They can easily exceed the capabilities of a single machine but also demand significant developer time and effort due to their convenience and ability to manipulate data with high-level abstractions that can be optimized. Therefore it is essential to design scalable dataframe solutions. There have been multiple efforts to be integrated into the most efficient fashion to tackle this problem, the most notable being the dataframe systems developed using distributed computing environments such as Dask and Ray. Even though Dask and Ray's distributed computing features look very promising, we perceive that the Dask Dataframes and Ray Datasets still have room for optimization In this paper, we present CylonFlow, an alternative distributed dataframe execution methodology that enables state-of-the-art performance and scalability on the same Dask and Ray infrastructure (supercharging them!). To achieve this, we integrate a high-performance dataframe system Cylon, which was originally based on an entirely different execution paradigm, into Dask and Ray. Our experiments show that on a pipeline of dataframe operators, CylonFlow achieves 30 × more distributed performance than Dask Dataframes. Interestingly, it also enables superior sequential performance due to leveraging the native C++ execution of Cylon. We believe the performance of Cylon in conjunction with CylonFlow extends beyond the data engineering domain and can be used to consolidate high-performance computing and distributed computing ecosystems.

Funder

National Science Foundation

Publisher

Frontiers Media SA

Reference30 articles.

1. State of Data Science 2020-anaconda.com2021

2. Arrow Columnar Format x2014; Apache Arrow v9.0.0

3. Catalyst and tungsten: Apache spark's speeding engine2020

4. erlang;Armstrong;Commun. ACM,2010