Abstract
Switch failures can hamper access to client services, cause link congestion and blackhole network traffic. In this study, we examine the nature of switch failures in the datacenters of a large commercial cloud provider through the lens of survival theory. We study a cohort of over 180,000 switches with a variety of hardware and software configurations and find that datacenter switches have a 98% likelihood of functioning uninterrupted for over 3 months since deployment in production. However, there is significant heterogeneity in switch survival rates with respect to their hardware and software: the switches of one vendor are twice as likely to fail compared to the others. We attribute the majority of switch failures to hardware impairments and unplanned power losses. We find that the in-house switch operating system, SONiC, boosts the survival likelihood of switches in datacenters by 1% by eliminating switch failures caused by software bugs in vendor switch OSes.
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Networks and Communications,Software
Reference12 articles.
1. Arista Networks. AAA Configuration. https://www.arista.com/en/um-eos/eos-aaa-configuration. (Accessed on 2020-05-11). Arista Networks. AAA Configuration. https://www.arista.com/en/um-eos/eos-aaa-configuration. (Accessed on 2020-05-11).
2. Arista Networks. EOS Central: Does this indicate a possible DRAM issue? https://eos.arista.com/forum/getting-ipt_crcerrpkt-and-jer_int_idr_mmu_ecc_1b_err_int-log-output-does-this-indicate-a-possible-dram-issue-on-bank-b/. (Accessed on 2020-05-11). Arista Networks. EOS Central: Does this indicate a possible DRAM issue? https://eos.arista.com/forum/getting-ipt_crcerrpkt-and-jer_int_idr_mmu_ecc_1b_err_int-log-output-does-this-indicate-a-possible-dram-issue-on-bank-b/. (Accessed on 2020-05-11).
3. Surviving failures in bandwidth-constrained datacenters
4. Regression Models and Life-Tables
5. Understanding network failures in data centers
Cited by
15 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Canary: Congestion-aware in-network allreduce using dynamic trees;Future Generation Computer Systems;2024-03
2. Photonic switched networking for data centers and advanced computing systems;Optical Fiber Communication Conference (OFC) 2024;2024
3. P4toNFV: Offloading from P4 switches to NFV in programmable data planes;International Journal of Communication Systems;2023-12-21
4. Physical Deployability Matters;Proceedings of the 22nd ACM Workshop on Hot Topics in Networks;2023-11-28
5. Availability modeling and evaluation of switches and data centers;2023 10th International Conference on Dependable Systems and Their Applications (DSA);2023-08-10