Affiliation:
1. Technion, Haifa, Israel
2. IBM Research, Haifa, Israel
Abstract
One of the adverse effects of shrinking transistor sizes is that processors have become increasingly prone to hardware faults. At the same time, the number of cores per die rises. Consequently, core failures can no longer be ruled out, and future operating systems for many-core machines will have to incorporate fault tolerance mechanisms.
We present CSR, a strategy for recovery from unexpected permanent processor faults in commodity operating systems. Our approach overcomes surprise removal of faulty cores, and also tolerates cascading core failures. When a core fails in user mode, CSR terminates the process executing on that core and migrates the remaining processes in its run-queue to other cores. We further show how hardware transactional memory may be used to overcome failures in critical kernel code. Our solution is scalable, incurs low overhead, and is designed to integrate into modern operating systems. We have implemented it in the Linux kernel, using Haswell's Transactional Synchronization Extension, and tested it on a real system.
Funder
Intel Collaborative Research Institute for Computational Intelligence
Technion Funds for Security Research
Hasso-Plattner Institue
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Graphics and Computer-Aided Design,Software
Reference78 articles.
1. Alexey Kopytov. SysBench - A Modular Cross-Platform and Multi-Threaded Benchmark Tool 2016. Alexey Kopytov. SysBench - A Modular Cross-Platform and Multi-Threaded Benchmark Tool 2016.
2. COTSon
3. Ashok Raj. CPU Hotplug Support in Linux Kernel. In Linux Documentation. Ashok Raj. CPU Hotplug Support in Linux Kernel. In Linux Documentation.
4. The multikernel
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献