Affiliation:
1. University of Wisconsin, Madison, WI, USA
Abstract
Failures caused by software bugs are widespread in production runs, causing severe losses for end users. Unfortunately, diagnosing production-run failures is challenging. Existing work cannot satisfy privacy, run-time overhead, diagnosis capability, and diagnosis latency requirements all at once.
This paper designs a low overhead, low latency, privacy preserving production-run failure diagnosis system based on two observations. First, short-term memory of program execution is often sufficient for failure diagnosis, as many bugs have short propagation distances. Second, maintaining a short-term memory of execution is much cheaper than maintaining a record of the whole execution. Following these observations, we first identify an existing hardware unit, Last Branch Record (LBR), that records the last few taken branches to help diagnose sequential bugs. We then propose a simple hardware extension, Last Cache-coherence Record (LCR), to record the last few cache accesses with specified coherence states and hence help diagnose concurrency bugs. Finally, we design LBRA and LCRA to automatically locate failure root causes using LBR and LCR.
Our evaluation uses 31 real-world sequential and concurrency bug failures from 18 representative open-source software. The results show that with just 16 record entries, LBR and LCR enable our system to automatically locate the root causes for 27 out of 31 failures, with less than 3% run-time overhead. As our system does not rely on sampling,
Publisher
Association for Computing Machinery (ACM)
Reference50 articles.
1. ODR
2. Production-run software failure diagnosis via hardware performance counters
3. 012)}nainar.12P. Arumuga Nainar. Applications of Static Analysis and Program Structure in Statistical Debugging. PhD thesis University of Wisconsin -- Madison 2012. 012)}nainar.12P. Arumuga Nainar. Applications of Static Analysis and Program Structure in Statistical Debugging. PhD thesis University of Wisconsin -- Madison 2012.
4. Adaptive bug isolation
5. An API for Runtime Code Patching
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Who Watches the Watchmen;ACM Computing Surveys;2019-07-31