1. Robust Scheduling for Large-Scale Distributed Systems;2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom);2020-12
2. Predictive Reliability and Fault Management in Exascale Systems;ACM Computing Surveys;2020-10-15
3. Automatic abnormal log detection by analyzing log history for providing debugging insight;Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice;2020-06-27
4. Characterizing Accuracy-Aware Resilience of GPGPU Applications;2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID);2020-05
5. Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System;IEEE Transactions on Parallel and Distributed Systems;2019-02-01