Affiliation:
1. Peking University, Beijing, China
2. ByteDance Inc., Beijing , China
Abstract
This study demonstrates the salient facts and challenges of host failure operations in hyperscale data centers. A host incident can involve hundreds of distinct host-level metrics, covering broad aspects. The faulting mechanism inside the host connects these heterogeneous metrics through direct and indirect correlation, making it extremely difficult to sort out the propagation procedures and the root cause from these intertwined indicators. To deeply understand the failure mechanism inside the host, we develop HEAL -- a novel host metrics analysis toolkit. HEAL synergistically discovers dynamic causality in sparse heterogeneous host metrics by combining the strengths of both time series and random variable analysis. It can also proactively extract causal directional hints from causality's asymmetry and historical knowledge. Together, these breakthroughs help HEAL produce accurate results given undesirable inputs. Extensive experiments in our production environment verify that HEAL provides significantly better result accuracy and full-process interpretability than the SOTA baselines. With these advantages, HEAL successfully serves our data center and worldwide product operations and impressively contributes to many other workflows.
Funder
Qiyuan Lab Innovation Fund
ByteDance University Research Project
National Natural Science Foundation of China
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Networks and Communications,Hardware and Architecture,Safety, Risk, Reliability and Quality,Computer Science (miscellaneous)