Affiliation:
1. EMC Corporation, Santa Clara, CA
2. EMC Corporation and University of Texas at Arlington, Santa Clara, CA
3. Datrium, Inc., Santa Clara, CA
4. Datrium, Inc.
Abstract
Modern storage systems orchestrate a group of disks to achieve their performance and reliability goals. Even though such systems are designed to withstand the failure of individual disks, failure of multiple disks poses a unique set of challenges. We empirically investigate disk failure data from a large number of production systems, specifically focusing on the impact of disk failures on RAID storage systems. Our data covers about one million SATA disks from six disk models for periods up to 5 years. We show how observed disk failures weaken the protection provided by RAID. The count of
reallocated sectors
correlates strongly with impending failures.
With these findings we designed RAIDS
hield
, which consists of two components. First, we have built and evaluated an active defense mechanism that monitors the health of each disk and replaces those that are predicted to fail imminently. This proactive protection has been incorporated into our product and is observed to eliminate 88% of triple disk errors, which are 80% of all RAID failures. Second, we have designed and simulated a method of using the joint failure probability to quantify and predict how likely a RAID group is to face multiple simultaneous disk failures, which can identify disks that collectively represent a risk of failure even when no individual disk is flagged in isolation. We find in simulation that RAID-level analysis can effectively identify most vulnerable RAID-6 systems, improving the coverage to 98% of triple errors.
We conclude with discussions of operational considerations in deploying RAIDS
hield
more broadly and new directions in the analysis of disk errors. One interesting approach is to combine multiple metrics, allowing the values of different indicators to be used for predictions. Using newer field data that reports an additional metric,
medium errors
, we find that the relative efficacy of reallocated sectors and medium errors varies across disk models, offering an additional way to predict failures.
Publisher
Association for Computing Machinery (ACM)
Subject
Hardware and Architecture
Cited by
55 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献