Abstract
AbstractIn the last years, applications related to Artificial Intelligence and big data, among others, have been involved. There is a need to improve I/O operations to avoid bottlenecks in accessing a larger amount of data. For this purpose, the Expand Ad-Hoc parallel file system is being designed and developed.Since these applications have very long execution times, fault tolerance mechanisms in the file system are necessary to allow them to continue running in the presence of failures.This work introduces a fault-tolerant design based on data replication for the Expand Ad-Hoc parallel file system and an initial evaluation conducted on the HPC4AI Laboratory supercomputer in Torino.The evaluation of Expand Ad-Hoc with fault-tolerant found that, despite data replication, its performance and scalability are generally better than those of other parallel file systems without fault-tolerant.
Publisher
Springer Nature Switzerland
Reference20 articles.
1. BeeGFS: BeeGFS documentation 7.4.2 » architecture (2024). https://doc.beegfs.io/7.4.2/architecture/overview.html#mirroring (Accessed 18 March 2024)
2. Braam, P.: The lustre storage architecture. CoRR arXiv: 1903.01955 (2019)
3. Brinkmann, A., et al.: Ad hoc file systems for high-performance computing. J. Comput. Sci. Technol. 35(1), 4–26 (2020)
4. BSC: MareNostrum specification (2023). https://www.bsc.es/marenostrum/marenostrum/technical-information, (Accessed 18 March 2024)
5. Devarajan, H., Zheng, H., Kougkas, A., Sun, X.H., Vishwanath, V.: Dlio: A data-centric benchmark for scientific deep learning applications. In: 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid), vol. 1(81–91) (2021)