Improving cluster availability using workstation validation-Reference-Cited by-同舟云学术

Improving cluster availability using workstation validation

Published:2002-06 Issue:1 Volume:30 Page:217-227
ISSN:0163-5999
Container-title:ACM SIGMETRICS Performance Evaluation Review
language:en
Short-container-title:SIGMETRICS Perform. Eval. Rev.

Author:

Heath Taliver¹,Martin Richard P.¹,Nguyen Thu D.¹

Affiliation:

1. Rutgers University, Piscataway, NJ

Abstract

We demonstrate a framework for improving the availability of cluster based Internet services. Our approach models Internet services as a collection of interconnected components, each possessing well defined interfaces and failure semantics. Such a decomposition allows designers to engineer high availability based on an understanding of the interconnections and isolated fault behavior of each component, as opposed to ad-hoc methods. In this work, we focus on using the entire commodity workstation as a component because it possesses natural, fault-isolated interfaces. We define a failure event as a reboot because not only is a workstation unavailable during a reboot, but also because reboots are symptomatic of a larger class of failures, such as configuration and operator errors. Our observations of 3 distinct clusters show that the time between reboots is best modeled by a Weibull distribution with shape parameters of less than 1, implying that a workstation becomes more reliable the longer it has been operating. Leveraging this observed property, we design an allocation strategy which withholds recently rebooted workstations from active service, validating their stability before allowing them to return to service. We show via simulation that this policy leads to a 70-30 rule-of-thumb: For a constant utilization, approximately 70% of the workstation failures can be masked from end clients with 30% extra capacity added to the cluster, provided reboots are not strongly correlated. We also found our technique is most sensitive to the burstiness of reboots as opposed to absolute lengths of workstation uptimes.

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture,Software

Link

https://dl.acm.org/doi/pdf/10.1145/511399.511362

Reference23 articles.

1. Lessons from giant-scale services

Cited by 26 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Proactive Fault Prediction of Fog Devices Using LSTM-CRP Conceptual Framework for IoT Applications;Sensors;2023-03-08

2. Predicting machine behavior from Google cluster workload traces;Concurrency and Computation: Practice and Experience;2022-12-10

3. INEC: Fast and Coherent In-Network Erasure Coding;SC20: International Conference for High Performance Computing, Networking, Storage and Analysis;2020-11

4. Evaluation of Self-Healing Systems: An Analysis of the State-of-the-Art and Required Improvements;Computers;2020-02-27

5. Load Balance for Distributed Real-time Computing Systems;EAST CHINA NORM UNIV;2020