From Missteps to Milestones: A Journey to Practical Fail-Slow Detection

Author:

Lu Ruiming1ORCID,Xu Erci2ORCID,Zhang Yiming3ORCID,Zhu Fengyi4ORCID,Zhu Zhaosheng4ORCID,Wang Mengtian4ORCID,Zhu Zongpeng4ORCID,Xue Guangtao1ORCID,Shu Jiwu3ORCID,Li Minglu5ORCID,Wu Jiesheng4ORCID

Affiliation:

1. Shanghai Jiao Tong University, China

2. Alibaba Inc. and Shanghai Jiao Tong University, China

3. Xiamen University, China

4. Alibaba Inc., China

5. Shanghai Jiao Tong University and Zhejiang Normal University, China

Abstract

The newly emerging “fail-slow” failures plague both software and hardware where the victim components are still functioning yet with degraded performance. To address this problem, this article presents Perseus , a practical fail-slow detection framework for storage devices. Perseus leverages a light regression-based model to quickly pinpoint and analyze fail-slow failures at the granularity of drives. Within a 10-month close monitoring on 248K drives, Perseus managed to find 304 fail-slow cases. Isolating them can reduce the (node-level) 99.99th tail latency by 48%. We assemble a large-scale fail-slow dataset (including 41K normal drives and 315 verified fail-slow drives) from our production traces, based on which we provide root cause analysis on fail-slow drives covering a variety of ill-implemented scheduling, hardware defects, and environmental factors. We have released the dataset to the public for fail-slow study.

Funder

NSFC

Alibaba Innovation Research

National Key R&D Program of China

Program of Hunan Postdoc Innovation

Program of Shanghai Academic Research Leader

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture

Reference51 articles.

1. (n.d.). S.M.A.R.T. (Self-Monitoring Analysis and Reporting Technology). https://en.wikipedia.org/wiki/S.M.A.R.T.

2. Principal component analysis;Abdi Hervé;WIREs Computational Statistics,2010

3. Ramnatthan Alagappan, Aishwarya Ganesan, Yuvraj Patel, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. Correlated crash vulnerabilities. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). https://www.usenix.org/conference/osdi16/technical-sessions/presentation/alagappan

4. Jacob Alter, Ji Xue, Alma Dimnaku, and Evgenia Smirni. 2019. SSD failures in the field: Symptoms, causes, and prediction models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). Article 75, 14 pages. DOI:10.1145/3295500.3356172

5. Behnaz Arzani, Selim Ciraci, Luiz Chamon, Yibo Zhu, Hongqiang (Harry) Liu, Jitu Padhye, Boon Thau Loo, and Geoff Outhred. 2018. 007: Democratically finding the cause of packet drops. In Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI). https://www.usenix.org/conference/nsdi18/presentation/arzani

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3