Affiliation:
1. Microsoft Corporation
2. University of Virginia
Abstract
With the advent of cloud computing and online services, large enterprises rely heavily on their datacenters to serve end users. A large datacenter facility incurs increased maintenance costs in addition to service unavailability when there are increased failures. Among different server components, hard disk drives are known to contribute significantly to server failures; however, there is very little understanding of the major determinants of disk failures in datacenters. In this work, we focus on the interrelationship between temperature, workload, and hard disk drive failures in a large scale datacenter. We present a dense storage case study from a population housing thousands of servers and tens of thousands of disk drives, hosting a large-scale online service at Microsoft. We specifically establish correlation between temperatures and failures observed at different location granularities: (a) inside drive locations in a server chassis, (b) across server locations in a rack, and (c) across multiple racks in a datacenter. We show that temperature exhibits a stronger correlation to failures than the correlation of disk utilization with drive failures. We establish that variations in temperature are not significant in datacenters and have little impact on failures. We also explore workload impacts on temperature and disk failures and show that the impact of workload is not significant. We then experimentally evaluate knobs that control disk drive temperature, including workload and chassis design knobs. We corroborate our findings from the real data study and show that workload knobs show minimal impact on temperature. Chassis knobs like disk placement and fan speeds have a larger impact on temperature. Finally, we also show the proposed cost benefit of temperature optimizations that increase hard disk drive reliability.
Publisher
Association for Computing Machinery (ACM)
Subject
Hardware and Architecture
Reference31 articles.
1. Cole G. 2000. Estimating drive reliability in desktop computers and consumer electronics systems. Seagate Tech. rep. TP-338.1. Cole G. 2000. Estimating drive reliability in desktop computers and consumer electronics systems. Seagate Tech. rep. TP-338.1.
2. Temperature management in data centers
3. Facebook 2011. Open compute project at Facebook. http://opencompute.org/. Facebook 2011. Open compute project at Facebook. http://opencompute.org/.
Cited by
39 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. New Weibull Log-Logistic grey forecasting model for a hard disk drive failures;Applied Mathematical Modelling;2024-07
2. Building a Rule-Based Expert System to Enhance the Hard Disk Drive Manufacturing Processes;IEEE Access;2024
3. Diffusion-Based Time Series Data Imputation for Cloud Failure Prediction at Microsoft 365;Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering;2023-11-30
4. Disk Failure Trends in Alpine Storage System;Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis;2023-11-12
5. Comparative eco-efficiency assessment of cybersecurity solutions;Environmental Impact Assessment Review;2023-05