Component-distinguishable Co-location and Resource Reclamation for High-throughput Computing-Reference-Cited by-同舟云学术

Component-distinguishable Co-location and Resource Reclamation for High-throughput Computing

Published:2023-11-18 Issue: Volume: Page:
ISSN:0734-2071
Container-title:ACM Transactions on Computer Systems
language:en
Short-container-title:ACM Trans. Comput. Syst.

Author:

Zhao Laiping¹,Cui Yushuai¹,Yang Yanan¹,Zhou Xiaobo¹,Qiu Tie¹,Li Keqiu¹,Bao Yungang²

Affiliation:

1. College of Intelligence and Computing, Tianjin University, Tianjin Key Lab. of Advanced Networking, China

2. Inst. of Computing Technology, CAS, China

Abstract

Cloud service providers improve resource utilization by co-locating latency-critical (LC) workloads with best-effort batch (BE) jobs in datacenters. However, they usually treat multi-component LCs as monolithic applications and treat BEs as ”second-class citizens” when allocating resources to them. Neglecting the inconsistent interference tolerance abilities of LC components and the inconsistent preemption loss of BE workloads can result in missed co-location opportunities for higher throughput. We present Rhythm , a co-location controller that deploys workloads and reclaims resources rhythmically for maximizing the system throughput while guaranteeing LC service’s tail latency requirement. The key idea is to differentiate the BE throughput launched with each LC component, that is, components with higher interference tolerance can be deployed together with more BE jobs. It also assigns different reclamation priority values to BEs by evaluating their preemption losses into a multi-level reclamation queue. We implement and evaluate Rhythm using workloads in the form of containerized processes and microservices. Experimental results show that it can improve the system throughput by 47.3%, CPU utilization by 38.6%, and memory bandwidth utilization by 45.4% while guaranteeing the tail latency requirement.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3630006

Reference106 articles.

1. 2020. https://parsec.cs.princeton.edu/. 2020. https://parsec.cs.princeton.edu/.

2. 2020. Scimark:A benchmark for scientific and numerical computing.https://openbenchmarking.org/test/pts/scimark2-1.3.2. 2020. Scimark:A benchmark for scientific and numerical computing.https://openbenchmarking.org/test/pts/scimark2-1.3.2.

3. 2020. The SPEC Cloud IaaS 2018 benchmark is SPEC’s second benchmark suite to measure cloud performance.https://www.spec.org/. 2020. The SPEC Cloud IaaS 2018 benchmark is SPEC’s second benchmark suite to measure cloud performance.https://www.spec.org/.

4. 2020. Tensorflow-Bench: A benchmark framework for TensorFlow.https://github.com/tensorflow/benchmarks. 2020. Tensorflow-Bench: A benchmark framework for TensorFlow.https://github.com/tensorflow/benchmarks.

5. S. Agarwala , F. Alegre , K. Schwan , and J. Mehalingham . 2007. E2EProf: Automated End-to-End Performance Management for Enterprise Systems . In The 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’07) . 749–758. S. Agarwala, F. Alegre, K. Schwan, and J. Mehalingham. 2007. E2EProf: Automated End-to-End Performance Management for Enterprise Systems. In The 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’07). 749–758.