Abstract
AbstractDue to the increase and complexity of computer systems, reducing the overhead of fault tolerance techniques has become important in recent years. One technique in fault tolerance is checkpointing, which saves a snapshot with the information that has been computed up to a specific moment, suspending the execution of the application, consuming I/O resources and network bandwidth. Characterizing the files that are generated when performing the checkpoint of a parallel application is useful to determine the resources consumed and their impact on the I/O system. It is also important to characterize the application that performs checkpoints, and one of these characteristics is whether the application does I/O. In this paper, we present a model of checkpoint behavior for parallel applications that performs I/O; this depends on the application and on other factors such as the number of processes, the mapping of processes and the type of I/O used. These characteristics will also influence scalability, the resources consumed and their impact on the IO system. Our model describes the behavior of the checkpoint size based on the characteristics of the system and the type (or model) of I/O used, such as the number I/O aggregator processes, the buffering size utilized by the two-phase I/O optimization technique and components of collective file I/O operations. The BT benchmark and FLASH I/O are analyzed under different configurations of aggregator processes and buffer size to explain our approach. The model can be useful when selecting what type of checkpoint configuration is more appropriate according to the applications’ characteristics and resources available. Thus, the user will be able to know how much storage space the checkpoint consumes and how much the application consumes, in order to establish policies that help improve the distribution of resources.
Funder
Agencia Estatal de Investigación
Universitat Autònoma de Barcelona
Publisher
Springer Science and Business Media LLC
Subject
Hardware and Architecture,Information Systems,Theoretical Computer Science,Software
Reference38 articles.
1. Ouyang X, Gopalakrishnan K, Gangadharappa T, Panda DK (2009) Fast checkpointing by Write Aggregation with Dynamic Buffer and Interleaving on multicore architecture. In 2009 International Conference on High Performance Computing (HiPC), pp 99–108. https://doi.org/10.1109/HIPC.2009.5433218
2. Leon B, Gomez P, Franco D, Rexachs D, Luque E (2020) Analysis of Checkpoint I/O behavior. In: International Conference on Computational Science (ICCS), S. N. S. A. 2020, Ed., ser. Lecture Notes in Computer Science, vol. 12137, Springer Nature Switzerland AG, pp 191–205
3. Boito FZ, Inacio EC, Bez JL, Navaux PO, Dantas MA, Denneulin Y (2018) A checkpoint of research on parallel I/O for high-performance computing. ACM Comput Surv (CSUR) 51(2):1–35
4. Bailey DH, Barszcz E, Barton JT et al (1991) The NAS parallel benchmarks. The Int J Supercomput Appl 5(3):63–73
5. The HDF Group. Hierarchical Data Format, version 5. (1997-2018), [Online]. Available: http://www.hdfgroup.org/HDF5/
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献