SherLog-Reference-Cited by-同舟云学术

SherLog

Published:2010-03-05 Issue:1 Volume:38 Page:143-154
ISSN:0163-5964
Container-title:ACM SIGARCH Computer Architecture News
language:en
Short-container-title:SIGARCH Comput. Archit. News

Author:

Yuan Ding¹,Mai Haohui¹,Xiong Weiwei¹,Tan Lin²,Zhou Yuanyuan³,Pasupathy Shankar⁴

Affiliation:

1. University of Illinois at Urbana-Champaign, Urbana, IL, USA

2. University of Waterloo, Waterloo, ON, Canada

3. University of California San Diego, San Diego, CA, USA

4. NetApp, Inc, Sunnyvale, CA, USA

Abstract

Computer systems often fail due to many factors such as software bugs or administrator errors. Diagnosing such production run failures is an important but challenging task since it is difficult to reproduce them in house due to various reasons: (1) unavailability of users' inputs and file content due to privacy concerns; (2) difficulty in building the exact same execution environment; and (3) non-determinism of concurrent executions on multi-processors. Therefore, programmers often have to diagnose a production run failure based on logs collected back from customers and the corresponding source code. Such diagnosis requires expert knowledge and is also too time-consuming, tedious to narrow down root causes. To address this problem, we propose a tool, called SherLog, that analyzes source code by leveraging information provided by run-time logs to infer what must or may have happened during the failed production run. It requires neither re-execution of the program nor knowledge on the log's semantics. It infers both control and data value information regarding to the failed execution. We evaluate SherLog with 8 representative real world software failures (6 software bugs and 2 configuration errors) from 7 applications including 3 servers. Information inferred by SherLog are very useful for programmers to diagnose these evaluated failures. Our results also show that SherLog can analyze large server applications such as Apache with thousands of logging messages within only 40 minutes.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/1735970.1736038

Reference50 articles.

1. H. Agrawal R. A. DeMillo and E. H. Spafford. Debugging with dynamic slicing and backtracking. Software -- Practice and Experience 23(6):589--616 June 1993. 10.1002/spe.4380230603 H. Agrawal R. A. DeMillo and E. H. Spafford. Debugging with dynamic slicing and backtracking. Software -- Practice and Experience 23(6):589--616 June 1993. 10.1002/spe.4380230603

2. Performance debugging for distributed systems of black boxes

3. A. Aiken S. Bugrara I. Dillig T. Dillig P. Hawkins and B. Hackett. The Saturn Program Analysis System. A. Aiken S. Bugrara I. Dillig T. Dillig P. Hawkins and B. Hackett. The Saturn Program Analysis System.

Cited by 26 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. QoS-Aware Co-Scheduling for Distributed Long-Running Applications on Shared Clusters;IEEE Transactions on Parallel and Distributed Systems;2022-12-01

2. A Survey of AIOps Methods for Failure Management;ACM Transactions on Intelligent Systems and Technology;2021-12-31

3. LogFlash: Real-time Streaming Anomaly Detection and Diagnosis from System Logs for Large-scale Software Systems;2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE);2021-10

4. sBiLSAN: Stacked Bidirectional Self-attention LSTM Network for Anomaly Detection and Diagnosis from System Logs;Lecture Notes in Networks and Systems;2021-08-07

5. Identifying Anomaly Detection Patterns from Log Files: A Dynamic Approach;Computational Science and Its Applications – ICCSA 2021;2021