One Size Does Not Fit All: Clustering Supercomputer Failures Using a Multiple Time Window Approach

Author:

Di Martino Catello

Publisher

Springer Berlin Heidelberg

Reference24 articles.

1. Guermouche, A., Ropars, T., Snir, M., Cappello, F.: Hydee: Failure containment without event logging for large scale send-deterministic mpi applications. In: 2012 IEEE 26th International on Parallel Distributed Processing Symposium (IPDPS), pp. 1216–1227 (May 2012)

2. Fu, S., Xu, C.: Exploring event correlation for failure prediction in coalitions of clusters. In: SC 2007: Proc. of the 2007 ACM/IEEE Conference on Supercomputing, pp. 1–12. ACM (2007)

3. Gainaru, A., Cappello, F., Snir, M., Kramer, W.: Fault prediction under the microscope: a closer look into hpc systems. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 77:1–77:11. IEEE Computer Society Press, Los Alamitos (2012)

4. Di Martino, C., Cinque, M., Cotroneo, D.: Assessing time coalescence techniques for the analysis of supercomputer logs. In: IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012), pp. 1–12 (2012)

5. Buckley, M.F., Siewiorek, D.P.: A comparative analysis of event tupling schemes. In: FTCS 1996: Proc. of the The Twenty-Sixth Annual Int. Symp. on Fault-Tolerant Computing (FTCS 1996), p. 294. IEEE Computer Society, Washington, DC (1996)

Cited by 10 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Predicting faults in high performance computing systems;Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis;2019-11-17

2. Resiliency of HPC Interconnects: A Case Study of Interconnect Failures and Recovery in Blue Waters;IEEE Transactions on Dependable and Secure Computing;2018-11-01

3. Desh;Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing;2018-06-11

4. Analysis and Diagnosis of SLA Violations in a Production SaaS Cloud;IEEE Transactions on Reliability;2017-03

5. Measuring the Resiliency of Extreme-Scale Computing Environments;Springer Series in Reliability Engineering;2016

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3