Affiliation:
1. University of Chicago
2. Pure Storage
3. Parallel Machines
4. NetApp
5. Huawei
6. Twitter
7. Nutanix
8. IBM
9. Los Alamos National Laboratory
10. Argonne National Laboratory
11. New Mexico Consortium
12. University of Utah
13. University of California, Santa Cruz
14. University of Chicago Research Computing Center
Abstract
Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 14 institutions. We show that all hardware types such as disk, SSD, CPU, memory, and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the cascading root causes and impacts can be long, and fail-slow faults can have varying symptoms. From this study, we make suggestions to vendors, operators, and systems designers.
Funder
DOE Office of Science User Facility
NSF
Publisher
Association for Computing Machinery (ACM)
Subject
Hardware and Architecture
Reference43 articles.
1. 2011. NAND Flash Media Management Through RAIN. Micron. 2011. NAND Flash Media Management Through RAIN. Micron.
Cited by
48 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. DRACO: Distributed Resource-aware Admission Control for large-scale, multi-tier systems;Journal of Parallel and Distributed Computing;2024-10
2. Understanding Silent Data Corruption in Processors for Mitigating its Effects;ACM Transactions on Architecture and Code Optimization;2024-09-02
3. Asymmetric RAID: Rethinking RAID for SSD Heterogeneity;Proceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems;2024-07-08
4. Chronos: Finding Timeout Bugs in Practical Distributed Systems by Deep-Priority Fuzzing with Transient Delay;2024 IEEE Symposium on Security and Privacy (SP);2024-05-19
5. Detection Is Better Than Cure: A Cloud Incidents Perspective;Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering;2023-11-30