Affiliation:
1. Computer Science Division, Electrical Engineering and Computer Sciences, University of California, Berkeley, CA
Abstract
In the Sprite environment, tolerating faults means recovering from them quickly. Our position is that performance and availability are the desired features of the typical locally-distributed office/engineering environment, and that very fast server recovery is the most cost-effective way of providing such availability. Mechanisms used for reliability can be inappropriate in systems with the primary goal of performance, and some availability-oriented methods using replicated hardware or processes cost too much for these systems. In contrast, availability via fast recovery need not slow down a system, and our experience in Sprite shows that in some cases the same techniques that provide high performance also provide fast recovery. In our first attempt to reduce file server recovery times to less than 90 seconds, we take advantage of the distributed state already present in our file system, and a high-performance log-structured file system currently under implementation. As a long-term goal, we hope to reduce recovery to 10 seconds or less.
Publisher
Association for Computing Machinery (ACM)
Reference13 articles.
1. Recovery management in QuickSilver
2. The Sprite network operating system
3. [
4
] Roger Haskin. Personal Communication. September 30 1990. [4] Roger Haskin. Personal Communication. September 30 1990.
4. Fault-tolerant computing based on Mach
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. The RAMCloud Storage System;ACM Transactions on Computer Systems;2015-09-11
2. Large-scale cluster management at Google with Borg;Proceedings of the Tenth European Conference on Computer Systems;2015-04-17
3. Recovery in the Calypso file system;ACM Transactions on Computer Systems;1996-08