Affiliation:
1. University of Stuttgart, Institute for Parallel and Distributed Systems (IPVS) Stuttgart Germany
2. Robert Bosch GmbH Stuttgart Germany
Abstract
AbstractDue to the growing complexity of modern data centers, failures are not uncommon any more. Therefore, fault tolerance mechanisms play a vital role in fulfilling the availability requirements. Multiple availability models have been proposed to assess compute systems, among which Bayesian network models have gained popularity in industry and research due to its powerful modeling formalism. In particular, this work focuses on assessing the availability of redundant and replicated cloud computing services with Bayesian networks. So far, research on availability has only focused on modeling either infrastructure or communication failures in Bayesian networks, but have not considered both simultaneously. This work addresses practical modeling challenges of assessing the availability of large‐scale redundant and replicated services with Bayesian networks, including cascading and common‐cause failures from the surrounding infrastructure and communication network. In order to ease the modeling task, this paper introduces a high‐level modeling formalism to build such a Bayesian network automatically. Performance evaluations demonstrate the feasibility of the presented Bayesian network approach to assess the availability of large‐scale redundant and replicated services. This model is not only applicable in the domain of cloud computing it can also be applied for general cases of local and geo‐distributed systems.
Subject
Management Science and Operations Research,Safety, Risk, Reliability and Quality
Reference46 articles.
1. CotroneoD SimoneLD LiguoriP NatellaR BidokhtiN.Enhancing failure propagation analysis in cloud computing systems. In:2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE).IEEE;2019:139‐150. doi:10.1109/issre.2019.00023. ISSN 1071‐9458
2. RosemainM SatterR.Millions of websites offline after fire at French cloud services firm.https://www.reuters.com/article/us‐france‐ovh‐fire‐idUSKBN2B20NU Mar.2021 [Online; accessed 12‐Oct‐2021].
3. JanardhanS.Update about the october 4th outage.https://engineering.fb.com/2021/10/04/networking‐traffic/outage/ Oct.2021 [Online; accessed 12‐Oct‐2021].
4. BrownA.Facebook Lost About $65 Million During Hours‐Long Outage.https://www.forbes.com/sites/abrambrown/2021/10/05/facebook‐outage‐lost‐revenue/. Oct.2021 [Online; accessed 12‐Oct‐2021].
5. Cassandra