Optimizing Efficiency of Machine Learning Based Hard Disk Failure Prediction by Two-Layer Classification-Based Feature Selection
-
Published:2023-06-26
Issue:13
Volume:13
Page:7544
-
ISSN:2076-3417
-
Container-title:Applied Sciences
-
language:en
-
Short-container-title:Applied Sciences
Author:
Wang Han1ORCID, Zhuge Qingfeng1, Sha Edwin Hsing-Mean1, Xu Rui1, Song Yuhong1
Affiliation:
1. School of Computer Science and Technology, East China Normal University, Shanghai 200063, China
Abstract
Predicting hard disk failure effectively and efficiently can prevent the high costs of data loss for data storage systems. Disk failure prediction based on machine learning and artificial intelligence has gained notable attention, because of its good capabilities. Improving the accuracy and performance of disk failure prediction, however, is still a challenging problem. When disk failure is about to occur, the time is limited for the prediction process, including building models and predicting. Faster training would promote the efficiency of model updates, and late predictions not only have no value but also waste resources. To improve both the prediction quality and modeling timeliness, a two-layer classification-based feature selection scheme is proposed in this paper. An attribute filter calculating the importance of attributes was designed, to remove attributes insensitive to failure identification, where importance is gained based on the idea of classification tree models. Furthermore, by determining the correlation between features based on the correlation coefficient, an attribute classification method is proposed. In experiments, the models of machine learning and artificial intelligence were applied, and they included naïve Bayesian, random forest, support vector machine, gradient boosted decision tree, convolutional neural networks, and long short-term memory. The results showed that the proposed technique could improve the prediction accuracy of ML/AI-based hard disk failure prediction models. Specifically, utilizing random forest and long short-term memory with the proposed technique showed the best accuracy. Meanwhile, the proposed scheme could reduce training and prediction latency by 75% and 83%, respectively, in the best case compared with the baseline methods.
Funder
NSFC Shanghai Science and Technology Commission Project
Subject
Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science
Reference33 articles.
1. Wang, G., Zhang, L., and Xu, W. (2017, January 26–29). What can we learn from four years of data center hardware failures?. Proceedings of the DSN, Denver, CO, USA. 2. Ghemawat, S., Gobioff, H., and Leung, S.T. (2003, January 19–22). The Google file system. Proceedings of the SOSP, New York, NY, USA. 3. Huang, C., Simitci, H., Xu, Y., Ogus, A., Calder, B., Gopalan, P., Li, J., and Yekhanin, S. (2012, January 13–15). Erasure coding in windows azure storage. Proceedings of the ATC, Boston, MA, USA. 4. Patterson, D.A., Gibson, G., and Katz, R.H. (1988, January 1–3). A case for redundant arrays of inexpensive disks (RAID). Proceedings of the SIGMOD, Chicago, IL, USA. 5. Li, J., Stones, R.J., Wang, G., Li, Z., Liu, X., and Xiao, K. (2016, January 26–29). Being accurate is not enough: New metrics for disk failure prediction. Proceedings of the SRDS, Budapest, Hungary.
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
|
|