Fuxi-Reference-Cited by-同舟云学术

Fuxi

Published:2014-08 Issue:13 Volume:7 Page:1393-1404
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Zhang Zhuo¹,Li Chao¹,Tao Yangyu¹,Yang Renyu²,Tang Hong¹,Xu Jie³

Affiliation:

1. Alibaba Cloud Computing Inc.

2. Beihang University and Alibaba Cloud Computing Inc.

3. University of Leeds

Abstract

Scalability and fault-tolerance are two fundamental challenges for all distributed computing at Internet scale. Despite many recent advances from both academia and industry, these two problems are still far from settled. In this paper, we present Fuxi, a resource management and job scheduling system that is capable of handling the kind of workload at Alibaba where hundreds of terabytes of data are generated and analyzed everyday to help optimize the company's business operations and user experiences. We employ several novel techniques to enable Fuxi to perform efficient scheduling of hundreds of thousands of concurrent tasks over large clusters with thousands of nodes: 1) an incremental resource management protocol that supports multi-dimensional resource allocation and data locality; 2) user-transparent failure recovery where failures of any Fuxi components will not impact the execution of user jobs; and 3) an effective detection mechanism and a multi-level blacklisting scheme that prevents them from affecting job execution. Our evaluation results demonstrate that 95% and 91% scheduled CPU/memory utilization can be fulfilled under synthetic workloads, and Fuxi is capable of achieving 2.36T-B/minute throughput in GraySort. Additionally, the same Fuxi job only experiences approximately 16% slowdown under a 5% fault-injection rate. The slowdown only grows to 20% when we double the fault-injection rate to 10%. Fuxi has been deployed in our production environment since 2009, and it now manages hundreds of thousands of server nodes.

Publisher

VLDB Endowment

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/2733004.2733012

Cited by 114 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Scheduling IDC-based virtual power plants considering backup power;Electric Power Systems Research;2024-09

2. Zero+: Monitoring Large-Scale Cloud-Native Infrastructure Using One-Sided RDMA;IEEE/ACM Transactions on Networking;2024-08

3. LGDCloudSim: A Resource Management Simulation System for Large-Scale Geographically Distributed Cloud Data Center Scenarios;2024 IEEE 17th International Conference on Cloud Computing (CLOUD);2024-07-07

4. A Spark Optimizer for Adaptive, Fine-Grained Parameter Tuning;Proceedings of the VLDB Endowment;2024-07

5. PPS: Fair and efficient black-box scheduling for multi-tenant GPU clusters;Parallel Computing;2024-06