Abstract
We present Lux, a distributed multi-GPU system that achieves fast graph processing by exploiting the aggregate memory bandwidth of multiple GPUs and taking advantage of locality in the memory hierarchy of multi-GPU clusters. Lux provides two execution models that optimize algorithmic efficiency and enable important GPU optimizations, respectively. Lux also uses a novel dynamic load balancing strategy that is cheap and achieves good load balance across GPUs. In addition, we present a performance model that quantitatively predicts the execution times and automatically selects the runtime configurations for Lux applications. Experiments show that Lux achieves up to 20X speedup over state-of-the-art shared memory systems and up to two orders of magnitude speedup over distributed systems.
Subject
General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development
Cited by
52 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. GraphSER: Distance-Aware Stream-Based Edge Repartition for Many-Core Systems;ACM Transactions on Architecture and Code Optimization;2024-09-14
2. Improving Graph Compression for Efficient Resource-Constrained Graph Analytics;Proceedings of the VLDB Endowment;2024-05
3. Two-Face: Combining Collective and One-Sided Communication for Efficient Distributed SpMM;Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2;2024-04-27
4. A Comprehensive Survey on Distributed Training of Graph Neural Networks;Proceedings of the IEEE;2023-12
5. Automated Mapping of Task-Based Programs onto Distributed and Heterogeneous Machines;Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis;2023-11-11