Lifeline-based global load balancing-Reference-Cited by-同舟云学术

Lifeline-based global load balancing

Published:2011-09-07 Issue:8 Volume:46 Page:201-212
ISSN:0362-1340
Container-title:ACM SIGPLAN Notices
language:en
Short-container-title:SIGPLAN Not.

Author:

Saraswat Vijay A.¹,Kambadur Prabhanjan²,Kodali Sreedhar³,Grove David⁴,Krishnamoorthy Sriram⁵

Affiliation:

1. IBM TJ Watson Research Centre, Hawthorne, NY, USA

2. IBM TJ Watson Research Centre, Yorktown, USA

3. IBM Systems and Technology Group, Bangalore, India

4. IBM TJ Watson Research Centre, Hathorne, USA

5. Pacific Northwest National Laboratory, Richland, USA

Abstract

On shared-memory systems, Cilk-style work-stealing has been used to effectively parallelize irregular task-graph based applications such as Unbalanced Tree Search (UTS). There are two main difficulties in extending this approach to distributed memory. In the shared memory approach, thieves (nodes without work) constantly attempt to asynchronously steal work from randomly chosen victims until they find work. In distributed memory, thieves cannot autonomously steal work from a victim without disrupting its execution. When work is sparse, this results in performance degradation. In essence, a direct extension of traditional work-stealing to distributed memory violates the work-first principle underlying work-stealing. Further, thieves spend useless CPU cycles attacking victims that have no work, resulting in system inefficiencies in multi-programmed contexts. Second, it is non-trivial to detect active distributed termination (detect that programs at all nodes are looking for work, hence there is no work). This problem is well-studied and requires careful design for good performance. Unfortunately, in most existing languages/frameworks, application developers are forced to implement their own distributed termination detection. In this paper, we develop a simple set of ideas that allow work-stealing to be efficiently extended to distributed memory. First, we introduce lifeline graphs: low-degree, low-diameter, fully connected directed graphs. Such graphs can be constructed from k -dimensional hypercubes. When a node is unable to find work after w unsuccessful steals, it quiesces after informing the outgoing edges in its lifeline graph. Quiescent nodes do not disturb other nodes. A quiesced node is reactivated when work arrives from a lifeline and itself shares this work with those of its incoming lifelines that are activated. Termination occurs precisely when computation at all nodes has quiesced. In a language such as X10, such passive distributed termination can be detected automatically using the finish construct -- no application code is necessary. Our design is implemented in a few hundred lines of X10. On the binomial tree described in olivier:08}, the program achieve 87% efficiency on an Infiniband cluster of 1024 Power7 cores, with a peak throughput of 2.37 GNodes/sec. It achieves 87% efficiency on a Blue Gene/P with 2048 processors, and a peak throughput of 0.966 GNodes/s. All numbers are relative to single core sequential performance. This implementation has been refactored into a reusable global load balancing framework. Applications can use this framework to obtain global load balance with minimal code changes. In summary, we claim: (a) the first formulation of UTS that does not involve application level global termination detection, (b) the introduction of lifeline graphs to reduce failed steals (c) the demonstration of simple lifeline graphs based on k-hypercubes, (d) performance with superior efficiency (or the same efficiency but over a wider range) than published results on UTS. In particular, our framework can deliver the same or better performance as an unrestricted random work-stealing implementation, while reducing the number of attempted steals.

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Graphics and Computer-Aided Design,Software

Link

https://dl.acm.org/doi/pdf/10.1145/2038037.1941582

Reference31 articles.

1. ATLAS

2. The Natural Work-Stealing Algorithm is Stable

3. Starting with termination: a methodology for building distributed garbage collection algorithms;Blackburn S. M.;Aust. Comput. Sci. Commun.,2001

4. Scheduling multithreaded computations by work stealing

Cited by 49 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Exploiting inherent elasticity of serverless in algorithms with unbalanced and irregular workloads;Journal of Parallel and Distributed Computing;2024-08

2. Task-Level Checkpointing for Nested Fork-Join Programs Using Work Stealing;Lecture Notes in Computer Science;2024

3. Malleable APGAS Programs and Their Support in Batch Job Schedulers;Lecture Notes in Computer Science;2024

4. Automated Mapping of Task-Based Programs onto Distributed and Heterogeneous Machines;Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis;2023-11-11

5. RD-FCA: A resilient distributed framework for formal concept analysis;Journal of Parallel and Distributed Computing;2023-09