Affiliation:
1. University of Toronto, Toronto, Canada
2. Microsoft Research, Redmond, WA, USA
Abstract
We present the first large-scale analysis of failures in a data center network. Through our analysis, we seek to answer several fundamental questions: which devices/links are most unreliable, what causes failures, how do failures impact network traffic and how effective is network redundancy? We answer these questions using multiple data sources commonly collected by network operators. The key findings of our study are that (1) data center networks show high reliability, (2) commodity switches such as ToRs and AggS are highly reliable, (3) load balancers dominate in terms of failure occurrences with many short-lived software related faults,(4) failures have potential to cause loss of many small packets such as keep alive messages and ACKs, and (5) network redundancy is only 40% effective in reducing the median impact of failure.
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Networks and Communications,Software
Reference28 articles.
1. Cisco: Data center: Load balancing data center services 2004. www.cisco.com/en/US/solutions/collateral/ns340/ns517/ns224/ns668/net_implementation_white_paper0900aecd8053495a.html. Cisco: Data center: Load balancing data center services 2004. www.cisco.com/en/US/solutions/collateral/ns340/ns517/ns224/ns668/net_implementation_white_paper0900aecd8053495a.html.
2. Symbiotic routing in future data centers
3. A scalable, commodity data center network architecture
4. Data center TCP (DCTCP)
5. Network traffic characteristics of data centers in the wild
Cited by
410 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献