Understanding disk failure rates

Author:

Schroeder Bianca1,Gibson Garth A.1

Affiliation:

1. Carnegie Mellon University, Pittsburgh, PA

Abstract

Component failure in large-scale IT installations is becoming an ever-larger problem as the number of components in a single cluster approaches a million. This article is an extension of our previous study on disk failures [Schroeder and Gibson 2007] and presents and analyzes field-gathered disk replacement data from a number of large production systems, including high-performance computing sites and internet services sites. More than 110,000 disks are covered by this data, some for an entire lifetime of five years. The data includes drives with SCSI and FC, as well as SATA interfaces. The mean time-to-failure (MTTF) of those drives, as specified in their datasheets, ranges from 1,000,000 to 1,500,000 hours, suggesting a nominal annual failure rate of at most 0.88%. We find that in the field, annual disk replacement rates typically exceed 1%, with 2--4% common and up to 13% observed on some systems. This suggests that field replacement is a fairly different process than one might predict based on datasheet MTTF. We also find evidence, based on records of disk replacements in the field, that failure rate is not constant with age, and that rather than a significant infant mortality effect, we see a significant early onset of wear-out degradation. In other words, the replacement rates in our data grew constantly with age, an effect often assumed not to set in until after a nominal lifetime of 5 years. Interestingly, we observe little difference in replacement rates between SCSI, FC, and SATA drives, potentially an indication that disk-independent factors such as operating conditions affect replacement rates more than component-specific ones. On the other hand, we see only one instance of a customer rejecting an entire population of disks as a bad batch, in this case because of media error rates, and this instance involved SATA disks. Time between replacement, a proxy for time between failure, is not well modeled by an exponential distribution and exhibits significant levels of correlation, including autocorrelation and long-range dependence.

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture

Reference38 articles.

1. An analysis of latent sector errors in disk drives

2. CFDR. 2007. The computer failure data repository. http://cfdr.usenix.org/. CFDR. 2007. The computer failure data repository. http://cfdr.usenix.org/.

3. Cole G. 2000. Estimating drive reliability in desktop computers and consumer electronics systems. TP-338.1. Seagate Technology November. Cole G. 2000. Estimating drive reliability in desktop computers and consumer electronics systems. TP-338.1. Seagate Technology November.

4. Drummer D. Khurshudov A. Riedel E. and Watts R. 2006. Personal communication. Drummer D. Khurshudov A. Riedel E. and Watts R. 2006. Personal communication.

Cited by 63 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. A Hybrid Neural Ordinary Differential Equation Based Digital Twin Modeling and Online Diagnosis for an Industrial Cooling Fan;Future Internet;2023-09-04

2. Predicting Hard Disk Drive Faults, Failures and Associated Misbehavior’s;2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW);2023-05

3. Multidimensional Features Helping Predict Failures in Production SSD-Based Consumer Storage Systems;2023 Design, Automation & Test in Europe Conference & Exhibition (DATE);2023-04

4. Minimum Repair Bandwidth LDPC Codes for Distributed Storage Systems;IEEE Communications Letters;2023-02

5. HPC Forecast;Communications of the ACM;2023-01-20

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3