RAIDShield-Reference-Cited by-同舟云学术

RAIDShield

Published:2015-11-21 Issue:4 Volume:11 Page:1-28
ISSN:1553-3077
Container-title:ACM Transactions on Storage
language:en
Short-container-title:ACM Trans. Storage

Author:

Ma Ao¹,Traylor Rachel²,Douglis Fred¹,Chamness Mark¹,Lu Guanlin¹,Sawyer Darren¹,Chandra Surendar³,Hsu Windsor⁴

Affiliation:

1. EMC Corporation, Santa Clara, CA

2. EMC Corporation and University of Texas at Arlington, Santa Clara, CA

3. Datrium, Inc., Santa Clara, CA

4. Datrium, Inc.

Abstract

Modern storage systems orchestrate a group of disks to achieve their performance and reliability goals. Even though such systems are designed to withstand the failure of individual disks, failure of multiple disks poses a unique set of challenges. We empirically investigate disk failure data from a large number of production systems, specifically focusing on the impact of disk failures on RAID storage systems. Our data covers about one million SATA disks from six disk models for periods up to 5 years. We show how observed disk failures weaken the protection provided by RAID. The count of reallocated sectors correlates strongly with impending failures. With these findings we designed RAIDS hield , which consists of two components. First, we have built and evaluated an active defense mechanism that monitors the health of each disk and replaces those that are predicted to fail imminently. This proactive protection has been incorporated into our product and is observed to eliminate 88% of triple disk errors, which are 80% of all RAID failures. Second, we have designed and simulated a method of using the joint failure probability to quantify and predict how likely a RAID group is to face multiple simultaneous disk failures, which can identify disks that collectively represent a risk of failure even when no individual disk is flagged in isolation. We find in simulation that RAID-level analysis can effectively identify most vulnerable RAID-6 systems, improving the coverage to 98% of triple errors. We conclude with discussions of operational considerations in deploying RAIDS hield more broadly and new directions in the analysis of disk errors. One interesting approach is to combine multiple metrics, allowing the values of different indicators to be used for predictions. Using newer field data that reports an additional metric, medium errors , we find that the relative efficacy of reallocated sectors and medium errors varies across disk models, offering an additional way to predict failures.

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/2820615

Reference49 articles.

1. Monitoring hard disks with S.M.A.R.T;Allen Bruce;Linux Journal,2004

2. Tolerating multiple failures in RAID architectures with optimal storage and uniform declustering

Cited by 55 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Locally Repairable Convertible Codes With Optimal Access Costs;IEEE Transactions on Information Theory;2024-09

2. Examining the impact of critical attributes on hard drive failure times: Multi‐state models for left‐truncated and right‐censored semi‐competing risks data;Applied Stochastic Models in Business and Industry;2023-12-03

3. Self-optimised cost-sensitive classifiers for early field failure prediction in storage systems;Swarm and Evolutionary Computation;2023-12

4. From Missteps to Milestones: A Journey to Practical Fail-Slow Detection;ACM Transactions on Storage;2023-11

5. Tunable Sparing of Disks in a Cloud Data Center;2023 7th International Conference on Computer Applications in Electrical Engineering-Recent Advances (CERA);2023-10-27