A Comprehensive Review on Power Efficient Fault Tolerance Models in High Performance Computation Systems

Author:

Shetty Nayana

Abstract

For the purpose of high performance computation, several machines are developed at an exascale level. These machines can perform at least one exaflop calculations per second, which corresponds to a billion billon or 108. The universe and nature can be understood in a better manner while addressing certain challenging computational issues by using these machines. However, certain obstacles are faced by these machines. As huge quantity of components is encompassed in the exascale machines, frequent failure may be experienced and also the resilience may be challenging. High progress rate must be maintained for the applications by incorporating certain form of fault tolerance in the system. Power management has to be performed by incorporating the system in a parallel manner. All layers inclusive of fault tolerance layer must adhere to the power limitation in the system. Huge energy bills may be expected on installation of exascale machines due to the high power consumption. For various fault tolerance models, the energy profile must be analyzed. Parallel recovery, message-logging, and restart or checkpoint fault tolerance models for rollback recovery are evaluated in this paper. For execution with failure, the most energy efficient solution is provided by parallel recovery when programs with various programming models are used. The execution is performed faster with parallel recovery when compared to the other techniques. An analytical model is used for exploring these models and their behavior at extreme scales.

Publisher

Inventive Research Organization

Cited by 5 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Improving Robustness of Two Speed Serial Parallel Booth Multiplier Using Fault Detection Mechanism;Lecture Notes in Electrical Engineering;2023

2. Advantages of Using IP Network Modeling Platforms in the Study of Power Electronic Devices;Lecture Notes in Electrical Engineering;2023

3. Auditory Machine Intelligence for Incipient Fault Localization and Classification in Transmission Lines;Proceedings of Third International Conference on Sustainable Expert Systems;2023

4. Effectiveness of Classification Techniques for Fault Bearing Prediction;2022 6th International Conference on Electronics, Communication and Aerospace Technology;2022-12-01

5. A Critical Review on the Low Power SIC MOSFET based Current Fed Inverter used in Surface Hardening Application;2022 3rd International Conference on Electronics and Sustainable Communication Systems (ICESC);2022-08-17

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3