BALANCE: Bayesian Linear Attribution for Root Cause Localization

Author:

Chen Chaoyu1ORCID,Yu Hang1ORCID,Lei Zhichao1ORCID,Li Jianguo1ORCID,Ren Shaokang1ORCID,Zhang Tingkai1ORCID,Hu Silin1ORCID,Wang Jianchao1ORCID,Shi Wenhui2ORCID

Affiliation:

1. Ant Group, Hangzhou, China

2. OceanBase, Beijing, China

Abstract

Root Cause Analysis (RCA) plays an indispensable role in distributed data system maintenance and operations, as it bridges the gap between fault detection and system recovery. Existing works mainly study multidimensional localization or graph-based root cause localization. This paper opens up the possibilities of exploiting the recently developed framework of explainable AI (XAI) for the purpose of RCA. In particular, we propose BALANCE (BAyesian Linear AttributioN for root CausE localization), which formulates the problem of RCA through the lens of attribution in XAI and seeks to explain the anomalies in the target KPIs by the behavior of the candidate root causes. BALANCE consists of three innovative components. First, we propose a Bayesian multicollinear feature selection (BMFS) model to predict the target KPIs given the candidate root causes in a forward manner while promoting sparsity and concurrently paying attention to the correlation between the candidate root causes. Second, we introduce attribution analysis to compute the attribution score for each candidate in a backward manner. Third, we merge the estimated root causes related to each KPI if there are multiple KPIs. We extensively evaluate the proposed BALANCE method on one synthesis dataset as well as three real-world RCA tasks, that is, bad SQL localization, container fault localization, and fault type diagnosis for Exathlon. Results show that BALANCE outperforms the state-of-the-art (SOTA) methods in terms of accuracy with the least amount of running time, and achieves at least 6% notably higher accuracy than SOTA methods for real tasks. BALANCE has been deployed to production to tackle real-world RCA problems, and the online results further advocate its usage for real-time diagnosis in distributed data systems.

Publisher

Association for Computing Machinery (ACM)

Reference50 articles.

1. Pooja Aggarwal , Ajay Gupta , Prateeti Mohapatra , 2020 . Localization of operational faults in cloud applications by mining causal dependencies in logs using golden signals . In International Conference on Service-Oriented Computing. 137--149 . Pooja Aggarwal, Ajay Gupta, Prateeti Mohapatra, et al. 2020. Localization of operational faults in cloud applications by mining causal dependencies in logs using golden signals. In International Conference on Service-Oriented Computing. 137--149.

2. Marco Ancona , Cengiz Oztireli , and Markus Gross . 2019 . Explaining deep neural networks with a polynomial time algorithm for shapley value approximation . In International Conference on Machine Learning (ICML). 272--281 . Marco Ancona, Cengiz Oztireli, and Markus Gross. 2019. Explaining deep neural networks with a polynomial time algorithm for shapley value approximation. In International Conference on Machine Learning (ICML). 272--281.

3. Survey and Evaluation of Causal Discovery Methods for Time Series

4. On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation

5. Ranjita Bhagwan , Rahul Kumar , Ramachandran Ramjee , 2014 . Adtributor: Revenue debugging in advertising systems . In 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI). 43--55 . Ranjita Bhagwan, Rahul Kumar, Ramachandran Ramjee, et al. 2014. Adtributor: Revenue debugging in advertising systems. In 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI). 43--55.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3