Online detection of failures generated by storage simulator
-
Published:2021-01-01
Issue:1
Volume:1740
Page:012052
-
ISSN:1742-6588
-
Container-title:Journal of Physics: Conference Series
-
language:
-
Short-container-title:J. Phys.: Conf. Ser.
Author:
Arzymatov Kenenbek,Hushchyn Mikhail,Sapronov Andrey,Belavin Vladislav,Gremyachikh Leonid,Karpov Maksim,Ustyuzhanin Andrey
Abstract
Abstract
Modern large-scale data-farms consist of hundreds of thousands of storage devices that span distributed infrastructure. Devices used in modern data centers (such as controllers, links, SSD- and HDD-disks) can fail due to hardware as well as software problems. Such failures or anomalies can be detected by monitoring the activity of components using machine learning techniques. In order to use these techniques, researchers need plenty of historical data of devices in normal and failure mode for training algorithms. In this work, we challenge two problems: 1) lack of storage data in the methods above by creating a simulator and 2) applying existing online algorithms that can faster detect a failure occurred in one of the components.
We created a Go-based (golang) package for simulating the behavior of modern storage infrastructure. The software is based on the discrete-event modeling paradigm and captures the structure and dynamics of high-level storage system building blocks. The package's exible structure allows us to create a model of a real-world storage system with a configurable number of components. The primary area of interest is exploring the storage machine's behavior under stress testing or exploitation in the medium-or long-term for observing failures of its components.
To discover failures in the time series distribution generated by the simulator, we modified a change point detection algorithm that works in online mode. The goal of the change-point detection is to discover differences in time series distribution. This work describes an approach for failure detection in time series data based on direct density ratio estimation via binary classifiers.
Subject
General Physics and Astronomy
Reference14 articles.
1. A comprehensive review of hard-disk drive reliability;Yang,1999
2. Hard Disk Drive Reliability Modeling and Failure Prediction
3. Specifying reliability in the disk drive industry: No more mtbf’s;Elerath,2000
4. A practical approach to hard disk failure prediction in cloud platforms: Big data model for failure management in datacenters;Ganguly,2016
5. A survey of methods for time series change point detection