Affiliation:
1. College of Computer Science and Software Engineering Shenzhen University Shenzhen China
2. SKT Group Guangdong Province Key Laboratory of Popular High‐Performance Computers Shenzhen China
3. SKT Group Guangdong Province Engineering Center of China‐made High Performance Data Computing System Shenzhen China
Abstract
SummaryThis article designs and implements a runtime library for general dataflow programming, DFCPP (Luo Q, Huang J, Li J, Du Z. Proceedings of the 52nd International Conference on Parallel Processing Workshops. ACM; 2023:145‐152.), and builds upon it to design and implement a multi‐machine C++ dataflow library, M‐DFCPP. In comparison to existing dataflow programming environments, DFCPP features a user‐friendly interface and richer expressive capabilities (Luo Q, Huang J, Li J, Du Z. Proceedings of the 52nd International Conference on Parallel Processing Workshops. ACM; 2023:145‐152.), enabling the representation of various types of dataflow actor tasks (static, dynamic and conditional task). Besides that, DFCPP addresses the memory management and task scheduling for non‐uniform memory access architectures, while other dataflow libraries lack attention to these issues. M‐DFCPP extends the capability of current dataflow runtime libraries (DFCPP, taskflow, openstream, etc.) and capable of multi‐machine computing, while maintains the API compatible with DFCPP. M‐DFCPP adopts the concepts of master and follower (Dean J, Ghemawat S. Commun ACM. 2008;51(1):107‐113; Ghemawat S, Gobioff H, Leung ST. ACM SIGOPS Operating Systems Review. ACM; 2003:29‐43.), which form a worksharing framework as many multi‐machine system. To shift to the M‐DFCPP framework, a filtering layer is inserted to the original DFCPP, transforming it into followers that can cooperate with each other. The master is made of modules for scheduling, data processing, graph partition, state management and so forth. In benchmark tests with workload with directed acyclic graph topology of binary trees and random graphs, DFCPP demonstrated performance improvements of 20% and 8%, respectively, compared to the second fastest library. M‐DFCPP consistently exhibits outstanding performance across varying levels of concurrency and task workloads, achieving a maximum speedup of more than 20 over DFCPP, when the task parallelism exceeds 5000 on 32 nodes. Moreover, M‐DFCPP, as a runtime library supporting multi‐node dataflow computation, is compared with MPI, a runtime library supporting multi‐node control flow computation.
Funder
Science and Technology Foundation of Shenzhen City
National Key Research and Development Program of China