Abstract
Uninterrupted uptime is a critical aspect of Virtual Machines (VMs) offered by cloud hosting providers. Google's VMs run on top of rapidly changing infrastructure: we regularly update hardware and host software, and we must quickly respond to failing hardware. Frequent change is critical to both development velocity---deploying new versions of services and infrastructure---and the ability to respond rapidly to defects, including critical security fixes. Typically these updates would be disruptive, resulting in VM termination or restart. In this paper we present how we use VM live migration at scale to eliminate this disruption with minimal impact to the guest, performing over 1,000,0001migrations monthly in our production fleet, with 50ms median blackout, 300ms 99th percentile blackout.
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Graphics and Computer-Aided Design,Software
Cited by
12 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Custom Page Fault Handling With eBPF;Proceedings of the SIGCOMM Workshop on eBPF and Kernel Extensions;2024-08-04
2. Cloud-Native Computing: A Survey From the Perspective of Services;Proceedings of the IEEE;2024-01
3. A Taxonomy of Live Migration Management in Cloud Computing;ACM Computing Surveys;2023-10-05
4. MC-ELMM: Multi-Chip Endurance-Limited Memory Management;Proceedings of the International Symposium on Memory Systems;2023-10-02
5. BalCon — resource balancing algorithm for VM consolidation;Future Generation Computer Systems;2023-10