Interpretable Failure Localization for Microservice Systems Based on Graph Autoencoder-Reference-Cited by-同舟云学术

Interpretable Failure Localization for Microservice Systems Based on Graph Autoencoder

Published:2024-09-13 Issue: Volume: Page:
ISSN:1049-331X
Container-title:ACM Transactions on Software Engineering and Methodology
language:en
Short-container-title:ACM Trans. Softw. Eng. Methodol.

Author:

Sun Yongqian¹^ORCID,Lin Zihan¹^ORCID,Shi Binpeng¹^ORCID,Zhang Shenglin¹^ORCID,Ma Shiyu¹^ORCID,Jin Pengxiang²^ORCID,Zhong Zhenyu¹^ORCID,Pan Lemeng³^ORCID,Guo Yicheng³^ORCID,Pei Dan⁴^ORCID

Affiliation:

1. Nankai University, China

2. Alibaba (Beijing) Software Services Co., Ltd., China

3. AI Application Research Center, Huawei Technologies Co., China

4. Tsinghua University, China

Abstract

Accurate and efficient localization of root cause instances in large-scale microservice systems is of paramount importance. Unfortunately, prevailing methods face several limitations. Notably, some recent methods rely on supervised learning which necessitates a substantial amount of labeled data. However, labeling root cause instances is time-consuming and laborious, especially with multiple modalities of data including logs, traces, metrics, etc. Moreover, some approaches favor deep learning for localization but lack interpretability and continuous improvement mechanisms. To address the above challenges, we propose DeepHunt, a novel root cause localization method based on multimodal data analysis. Firstly, DeepHunt introduces Root Cause Score (RCS) by integrating reconstruction errors and failure propagation patterns (upstream-downstream relationships), imparting interpretability to the localization of root causes. Then, it embraces Graph Autoencoder (GAE) to address the limitation imposed by scarce labeled data. It employs data augmentation to mitigate the adverse effects of insufficient historical training samples. We evaluate DeepHunt on two open-source datasets, and it outperforms existing methods when facing a zero-label cold start. DeepHunt can be further improved by continuously fine-tuning through a feedback mechanism.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3695999

Reference60 articles.

1. Practical Approach to Asynchronous Multivariate Time Series Anomaly Detection and Localization

2. Jinwon An and Sungzoon Cho. 2015. Variational autoencoder based anomaly detection using reconstruction probability. Special lecture on IE 2, 1 (Dec. 2015), 1–18. https://api.semanticscholar.org/CorpusID:36663713

3. USAD

4. AWS. 2021. Summary of the AWS Service Event in the Northern Virginia (US-EAST-1) Region. https://aws.amazon.com/cn/message/11201/

5. Leo Breiman. 2001. Random forests. Machine learning 45 (2001), 5–32.