Capturing, indexing, clustering, and retrieving system history-Reference-Cited by-同舟云学术

Capturing, indexing, clustering, and retrieving system history

Published:2005-10-20 Issue:5 Volume:39 Page:105-118
ISSN:0163-5980
Container-title:ACM SIGOPS Operating Systems Review
language:en
Short-container-title:SIGOPS Oper. Syst. Rev.

Author:

Cohen Ira¹,Zhang Steve²,Goldszmidt Moises¹,Symons Julie¹,Kelly Terence¹,Fox Armando¹

Affiliation:

1. Hewlett-Packard Laboratories, Palo Alto, CA

2. Stanford University, Palo Alto, CA

Abstract

We present a method for automatically extracting from a running system an indexable signature that distills the essential characteristic from a system state and that can be subjected to automated clustering and similarity-based retrieval to identify when an observed system state is similar to a previously-observed state. This allows operators to identify and quantify the frequency of recurrent problems, to leverage previous diagnostic efforts, and to establish whether problems seen at different installations of the same site are similar or distinct. We show that the naive approach to constructing these signatures based on simply recording the actual ``raw'' values of collected measurements is ineffective, leading us to a more sophisticated approach based on statistical modeling and inference. Our method requires only that the system's metric of merit (such as average transaction response time) as well as a collection of lower-level operational metrics be collected, as is done by existing commercial monitoring tools. Even if the traces have no annotations of prior diagnoses of observed incidents (as is typical), our technique successfully clusters system states corresponding to similar problems, allowing diagnosticians to identify recurring problems and to characterize the ``syndrome'' of a group of problems. We validate our approach on both synthetic traces and several weeks of production traces from a customer-facing geoplexed 24 x 7 system; in the latter case, our approach identified a recurring problem that had required extensive manual diagnosis, and also aided the operators in correcting a previous misdiagnosis of a different problem.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/1095809.1095821

Reference22 articles.

1. Performance debugging for distributed systems of black boxes

Cited by 16 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. PerfSig;Proceedings of the 44th International Conference on Software Engineering;2022-05-21

2. Forecasting of Computer Network Anomalous States Based on Sequential Pattern Analysis of “Historical Data”;Automatic Control and Computer Sciences;2021-11

3. Using black-box performance models to detect performance regressions under varying workloads: an empirical study;Empirical Software Engineering;2020-08-28

4. DeepTriage;Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining;2020-07-06

5. Cyber anomaly detection: Using tabulated vectors and embedded analytics for efficient data mining;Journal of Algorithms & Computational Technology;2018-08-06