Affiliation:
1. Center for Computing in Engineering & Sciences University of Campinas (Unicamp) Sao Paulo Brazil
2. Institute of Computing University of Campinas (Unicamp) Sao Paulo Brazil
3. Bachelor's Course in Computer Science Federal University of Technology of Parana (UTFPR) Santa Helena Parana Brazil
Abstract
SummaryCloud computing has become popular for training deep‐learning (DL) models, avoiding the costs of acquiring and maintaining on‐premise systems. SageMaker is a cloud service that automates the execution of DL workloads. Its features include automatic hyperparameter optimization and use of spot instances. Nonetheless, it does not assist in selecting the right instance type for a workload. In public clouds, rent price depends on the configuration of the chosen instance type. Advanced and faster instances are typically more expensive, but not always the best choice. To select the optimal instance type, users must compare the workload's relative performance (and hence cost) on several candidates. Building on the execution profiles of multiple DL applications, we model the performance and cost of training DL applications on SageMaker and propose a lightweight technique to estimate these at low temporal and monetary cost. This method is a performance proxy that can be used to replace more expensive performance measurement procedures. So, it could speed up any technique that relies on such measurements. We show how it can help cloud customers seeking suitable instance types to train DL models, and that it can accurately predict the performance of different instance types when training these models on SageMaker.
Funder
Petrobras
Fundação de Amparo à Pesquisa do Estado de São Paulo
Conselho Nacional de Desenvolvimento Científico e Tecnológico