Affiliation:
1. Microsoft, Redmond, USA
2. Microsoft, Mountain View, USA
Abstract
The dynamic nature of resource allocation and runtime conditions on Cloud can result in high variability in a job's runtime across multiple iterations, leading to a poor experience. Identifying the sources of such variation and being able to predict and adjust for them is crucial to cloud service providers to design reliable data processing pipelines, provision and allocate resources, adjust pricing services, meet SLOs and debug performance hazards.
In this paper, we analyze the runtime variation of millions of production Scope jobs on Cosmos, an exabyte-scale internal analytics platform at Microsoft. We propose an innovative 2-step approach to predict job runtime distribution by characterizing typical distribution shapes combined with a classification model with an average accuracy of >96%, using an innovative interpretable machine-learning algorithm out-performing traditional regression models and better capturing long tails. We examine factors such as job plan characteristics and inputs, resource allocation, physical cluster heterogeneity and utilization, and scheduling policies.
To the best of our knowledge, this is the first study on predicting categories of runtime distributions for enterprise analytics workloads at scale. Furthermore, we examine how our methods can be used to analyze what-if scenarios, focusing on the impact of resource allocation, scheduling, and physical cluster provisioning decisions on a job's runtime consistency and predictability.
Publisher
Association for Computing Machinery (ACM)
Reference84 articles.
1. Amazon. 2022. Amazon EC2. https://aws.amazon.com/aws/ec2 Retrieved Feb 15, 2022 from Amazon. 2022. Amazon EC2. https://aws.amazon.com/aws/ec2 Retrieved Feb 15, 2022 from
2. 020)]% aws-athena Amazon.com Inc. 2020. Amazon Athena. https://aws.amazon.com/athena/ Retrieved July 4 2020 from 020)]% aws-athena Amazon.com Inc. 2020. Amazon Athena. https://aws.amazon.com/athena/ Retrieved July 4 2020 from
3. Detecting Abnormal Machine Characteristics in Cloud Infrastructures
4. Christopher M Bishop and Nasser M Nasrabadi . 2006. Pattern recognition and machine learning . Vol. 4 . Springer . Christopher M Bishop and Nasser M Nasrabadi. 2006. Pattern recognition and machine learning. Vol. 4. Springer.
5. JetScope