Affiliation:
1. Software Engineering Institute, East China Normal University, Shanghai 200062, China
Abstract
In the era of big data, to achieve an efficient deep learning and computer vision system for big data, developers need to build a computerized deep learning and computer vision system, and the system can simultaneously complete the tasks of deep learning and computer vision and large-scale data processing. The existing training dataset is reused, and the scene information is small, which cannot meet the needs of large-scale machine training, so it is necessary to include large-scale data, distributed computer system to complete the training. How to meet the training accuracy requirements of deep learning models and minimize the resource cost within the constrained time is a major challenge for distributed deep learning systems. Resource and batch size hyperparameter allocation are the main approaches to optimize the training accuracy and resource cost of models. Existing works have independently configured resources and batch size hyperparameters in terms of computational efficiency and training accuracy, respectively. However, the impact of the two types of configurations on model training accuracy and resource cost has complex dependencies, and it is difficult to achieve the goals of satisfying the model training accuracy requirements and minimizing the resource cost simultaneously by the existing independent configuration methods. To address these problems, this paper proposes a collaborative resource-batch size optimization configuration method for distributed deep learning systems. This method was firstly based on the monotonic function relationship between resource allocation and batch size hyperparameter allocation and model training time and training accuracy, and we select the order-preserving regression theoretical tool to build a model prediction model for single-round complete training time and final training accuracy for computer vision target classification and recognition, respectively; then, we use the abovementioned models together to solve the resource and batch size optimal allocation solutions to meet the model training accuracy requirements with the goal of minimizing resource cost. The optimal allocation of resources and batch size to meet the training accuracy requirements of the model is solved. In this paper, we evaluate the performance of the proposed method for computer vision target recognition based on the proposed distributed deep learning system.
Subject
Electrical and Electronic Engineering,Computer Networks and Communications,Information Systems