1. Analyzing the impact of system reliability events on applications in the titan supercomputer;Ashraf,2018
2. Toward exascale resilience: 2014 update;Cappello;Supercomput. Front. Innov. Int. J.,2014
3. Study and analysis of the high performance computing failures in China meteorological field;Chen;J. Geosci. Environ. Prot.,2017
4. LogAider: a tool for mining potential correlations of HPC log events;Di,2017
5. Exploring properties and correlations of fatal events in a large-scale HPC system;Di;IEEE Trans. Parallel Distrib. Syst.,2019