A Study of Failure Recovery and Logging of High-Performance Parallel File Systems

Author:

Han Runzhou1,Gatla Om Rameshwar1,Zheng Mai1ORCID,Cao Jinrui2,Zhang Di3,Dai Dong3ORCID,Chen Yong4,Cook Jonathan5

Affiliation:

1. Iowa State University, Ames, Iowa

2. State University of New York at Plattsburgh, Plattsburgh, New York

3. North Carolina University at Charlotte, Charlotte, North Carolina

4. Texas Tech University, Lubbock, Texas

5. New Mexico State University, Las Cruces, New Mexico

Abstract

Large-scale parallel file systems (PFSs) play an essential role in high-performance computing (HPC). However, despite their importance, their reliability is much less studied or understood compared with that of local storage systems or cloud storage systems. Recent failure incidents at real HPC centers have exposed the latent defects in PFS clusters as well as the urgent need for a systematic analysis. To address the challenge, we perform a study of the failure recovery and logging mechanisms of PFSs in this article. First, to trigger the failure recovery and logging operations of the target PFS, we introduce a black-box fault injection tool called   PFault , which is transparent to PFSs and easy to deploy in practice.   PFault emulates the failure state of individual storage nodes in the PFS based on a set of pre-defined fault models and enables examining the PFS behavior under fault systematically. Next, we apply PFault to study two widely used PFSs: Lustre and BeeGFS. Our analysis reveals the unique failure recovery and logging patterns of the target PFSs and identifies multiple cases where the PFSs are imperfect in terms of failure handling. For example, Lustre includes a recovery component called LFSCK to detect and fix PFS-level inconsistencies, but we find that LFSCK itself may hang or trigger kernel panics when scanning a corrupted Lustre. Even after the recovery attempt of LFSCK, the subsequent workloads applied to Lustre may still behave abnormally (e.g., hang or report I/O errors). Similar issues have also been observed in BeeGFS and its recovery component BeeGFS-FSCK. We analyze the root causes of the abnormal symptoms observed in depth, which has led to a new patch set to be merged into the coming Lustre release. In addition, we characterize the extensive logs generated in the experiments in detail and identify the unique patterns and limitations of PFSs in terms of failure logging. We hope this study and the resulting tool and dataset can facilitate follow-up research in the communities and help improve PFSs for reliable high-performance computing.

Funder

NSF

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture

Reference120 articles.

1. Lustre File System. http://lustre.org/.

2. BeeGFS File System. https://www.beegfs.io/.

3. The OrangeFS Project. 2017. http://www.orangefs.org/.

4. RAIDShield

Cited by 7 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Revisiting Erasure Codes: A Configuration Perspective;Proceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems;2024-07-08

2. PROV-IO: A Cross-Platform Provenance Framework for Scientific Data on HPC Systems;IEEE Transactions on Parallel and Distributed Systems;2024-05

3. Runtime Performance Anomaly Diagnosis in Production HPC Systems Using Active Learning;IEEE Transactions on Parallel and Distributed Systems;2024-04

4. Understanding Persistent-memory-related Issues in the Linux Kernel;ACM Transactions on Storage;2023-10-03

5. On the Reproducibility of Bugs in File-System Aware Storage Applications;2022 IEEE International Conference on Networking, Architecture and Storage (NAS);2022-10

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3