A lightweight performance proxy for deep‐learning model training on Amazon SageMaker-Reference-Cited by-同舟云学术

A lightweight performance proxy for deep‐learning model training on Amazon SageMaker

Published:2024-04-08 Issue:14 Volume:36 Page:
ISSN:1532-0626
Container-title:Concurrency and Computation: Practice and Experience
language:en
Short-container-title:Concurrency and Computation

Author:

Keller Tesser Rafael¹²³^ORCID,Marques Alvaro²,Borin Edson²^ORCID

Affiliation:

1. Center for Computing in Engineering & Sciences University of Campinas (Unicamp) Sao Paulo Brazil

2. Institute of Computing University of Campinas (Unicamp) Sao Paulo Brazil

3. Bachelor's Course in Computer Science Federal University of Technology of Parana (UTFPR) Santa Helena Parana Brazil

Abstract

SummaryCloud computing has become popular for training deep‐learning (DL) models, avoiding the costs of acquiring and maintaining on‐premise systems. SageMaker is a cloud service that automates the execution of DL workloads. Its features include automatic hyperparameter optimization and use of spot instances. Nonetheless, it does not assist in selecting the right instance type for a workload. In public clouds, rent price depends on the configuration of the chosen instance type. Advanced and faster instances are typically more expensive, but not always the best choice. To select the optimal instance type, users must compare the workload's relative performance (and hence cost) on several candidates. Building on the execution profiles of multiple DL applications, we model the performance and cost of training DL applications on SageMaker and propose a lightweight technique to estimate these at low temporal and monetary cost. This method is a performance proxy that can be used to replace more expensive performance measurement procedures. So, it could speed up any technique that relies on such measurements. We show how it can help cloud customers seeking suitable instance types to train DL models, and that it can accurately predict the performance of different instance types when training these models on SageMaker.

Funder

Petrobras

Fundação de Amparo à Pesquisa do Estado de São Paulo

Conselho Nacional de Desenvolvimento Científico e Tecnológico

Publisher

Wiley

Link

https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.8104

Reference22 articles.

1. Distributed Machine Learning on IAAS Clouds

2. Optimizing on-demand GPUs in the Cloud for Deep Learning Applications Training

3. Performance and Cost Comparison of Cloud Services for Deep Learning Workload

4. Predicting the Computational Cost of Deep Learning Models