Supercharging distributed computing environments for high-performance data engineering
-
Published:2024-07-12
Issue:
Volume:2
Page:
-
ISSN:2813-7337
-
Container-title:Frontiers in High Performance Computing
-
language:
-
Short-container-title:Front. High Perform. Comput.
Author:
Perera Niranda,Sarker Arup Kumar,Shan Kaiying,Fetea Alex,Kamburugamuve Supun,Kanewala Thejaka Amila,Widanage Chathura,Staylor Mills,Zhong Tianle,Abeykoon Vibhatha,von Laszewski Gregor,Fox Geoffrey
Abstract
The data engineering and data science community has embraced the idea of using Python and R dataframes for regular applications. Driven by the big data revolution and artificial intelligence, these frameworks are now ever more important in order to process terabytes of data. They can easily exceed the capabilities of a single machine but also demand significant developer time and effort due to their convenience and ability to manipulate data with high-level abstractions that can be optimized. Therefore it is essential to design scalable dataframe solutions. There have been multiple efforts to be integrated into the most efficient fashion to tackle this problem, the most notable being the dataframe systems developed using distributed computing environments such as Dask and Ray. Even though Dask and Ray's distributed computing features look very promising, we perceive that the Dask Dataframes and Ray Datasets still have room for optimization In this paper, we present CylonFlow, an alternative distributed dataframe execution methodology that enables state-of-the-art performance and scalability on the same Dask and Ray infrastructure (supercharging them!). To achieve this, we integrate a high-performance dataframe system Cylon, which was originally based on an entirely different execution paradigm, into Dask and Ray. Our experiments show that on a pipeline of dataframe operators, CylonFlow achieves 30 × more distributed performance than Dask Dataframes. Interestingly, it also enables superior sequential performance due to leveraging the native C++ execution of Cylon. We believe the performance of Cylon in conjunction with CylonFlow extends beyond the data engineering domain and can be used to consolidate high-performance computing and distributed computing ecosystems.
Funder
National Science Foundation
Publisher
Frontiers Media SA
Reference30 articles.
1. State of Data Science 2020-anaconda.com2021
2. Arrow Columnar Format x2014; Apache Arrow v9.0.0
3. Catalyst and tungsten: Apache spark's speeding engine2020
4. erlang;Armstrong;Commun. ACM,2010