RAIDShield

Author:

Ma Ao1,Traylor Rachel2,Douglis Fred1,Chamness Mark1,Lu Guanlin1,Sawyer Darren1,Chandra Surendar3,Hsu Windsor4

Affiliation:

1. EMC Corporation, Santa Clara, CA

2. EMC Corporation and University of Texas at Arlington, Santa Clara, CA

3. Datrium, Inc., Santa Clara, CA

4. Datrium, Inc.

Abstract

Modern storage systems orchestrate a group of disks to achieve their performance and reliability goals. Even though such systems are designed to withstand the failure of individual disks, failure of multiple disks poses a unique set of challenges. We empirically investigate disk failure data from a large number of production systems, specifically focusing on the impact of disk failures on RAID storage systems. Our data covers about one million SATA disks from six disk models for periods up to 5 years. We show how observed disk failures weaken the protection provided by RAID. The count of reallocated sectors correlates strongly with impending failures. With these findings we designed RAIDS hield , which consists of two components. First, we have built and evaluated an active defense mechanism that monitors the health of each disk and replaces those that are predicted to fail imminently. This proactive protection has been incorporated into our product and is observed to eliminate 88% of triple disk errors, which are 80% of all RAID failures. Second, we have designed and simulated a method of using the joint failure probability to quantify and predict how likely a RAID group is to face multiple simultaneous disk failures, which can identify disks that collectively represent a risk of failure even when no individual disk is flagged in isolation. We find in simulation that RAID-level analysis can effectively identify most vulnerable RAID-6 systems, improving the coverage to 98% of triple errors. We conclude with discussions of operational considerations in deploying RAIDS hield more broadly and new directions in the analysis of disk errors. One interesting approach is to combine multiple metrics, allowing the values of different indicators to be used for predictions. Using newer field data that reports an additional metric, medium errors , we find that the relative efficacy of reallocated sectors and medium errors varies across disk models, offering an additional way to predict failures.

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture

Reference49 articles.

1. Monitoring hard disks with S.M.A.R.T;Allen Bruce;Linux Journal,2004

2. Tolerating multiple failures in RAID architectures with optimal storage and uniform declustering

Cited by 55 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Locally Repairable Convertible Codes With Optimal Access Costs;IEEE Transactions on Information Theory;2024-09

2. Examining the impact of critical attributes on hard drive failure times: Multi‐state models for left‐truncated and right‐censored semi‐competing risks data;Applied Stochastic Models in Business and Industry;2023-12-03

3. Self-optimised cost-sensitive classifiers for early field failure prediction in storage systems;Swarm and Evolutionary Computation;2023-12

4. From Missteps to Milestones: A Journey to Practical Fail-Slow Detection;ACM Transactions on Storage;2023-11

5. Tunable Sparing of Disks in a Cloud Data Center;2023 7th International Conference on Computer Applications in Electrical Engineering-Recent Advances (CERA);2023-10-27

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3