Scaling up stochastic gradient descent for non-convex optimisation-Reference-Cited by-同舟云学术

Scaling up stochastic gradient descent for non-convex optimisation

Published:2022-10-07 Issue:11 Volume:111 Page:4039-4079
ISSN:0885-6125
Container-title:Machine Learning
language:en
Short-container-title:Mach Learn

Author:

Mohamad Saad,Alamri Hamad,Bouchachia Abdelhamid^ORCID

Abstract

AbstractStochastic gradient descent (SGD) is a widely adopted iterative method for optimizing differentiable objective functions. In this paper, we propose and discuss a novel approach to scale up SGD in applications involving non-convex functions and large datasets. We address the bottleneck problem arising when using both shared and distributed memory. Typically, the former is bounded by limited computation resources and bandwidth whereas the latter suffers from communication overheads. We propose a unified distributed and parallel implementation of SGD (named DPSGD) that relies on both asynchronous distribution and lock-free parallelism. By combining two strategies into a unified framework, DPSGD is able to strike a better trade-off between local computation and communication. The convergence properties of DPSGD are studied for non-convex problems such as those arising in statistical modelling and machine learning. Our theoretical analysis shows that DPSGD leads to speed-up with respect to the number of cores and number of workers while guaranteeing an asymptotic convergence rate of

$$O(1/\sqrt{T})$$

O ( 1 / T ) given that the number of cores is bounded by

$$T^{1/4}$$

T 1 / 4 and the number of workers is bounded by

$$T^{1/2}$$

T 1 / 2 where T is the number of iterations. The potential gains that can be achieved by DPSGD are demonstrated empirically on a stochastic variational inference problem (Latent Dirichlet Allocation) and on a deep reinforcement learning (DRL) problem (advantage actor critic - A2C) resulting in two algorithms: DPSVI and HSA2C. Empirical results validate our theoretical findings. Comparative studies are conducted to show the performance of the proposed DPSGD against the state-of-the-art DRL algorithms.

Funder

Horizon 2020 Framework Programme

Publisher

Springer Science and Business Media LLC

Subject

Artificial Intelligence,Software

Link

https://link.springer.com/content/pdf/10.1007/s10994-022-06243-3.pdf

Reference77 articles.

1. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., et al. (2016). Tensorflow: A system for large-scale machine learning. OSDI, 16, 265–283.

2. Abbeel, P., Coates, A., Quigley, M., & Ng, A. Y. (2007). An application of reinforcement learning to aerobatic helicopter flight. In Advances in neural information processing systems (pp. 1–8).

3. Adamski, I., Adamski, R., Grel, T., Jędrych, A., Kaczmarek, K., & Michalewski, H. (2018). Distributed deep reinforcement learning: Learn how to play atari games in 21 minutes. arXiv preprint arXiv:1801.02852.

4. Adamski, R., Grel, T., Klimek, M., & Michalewski, H. (2017). Atari games and intel processors. Workshop on Computer Games (pp. 1–18). Springer.

5. Agarwal, A., & Duchi, J.C. (2011). Distributed delayed stochastic optimization. In Neural Information Processing Systems.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Medical nearest-word embedding technique implemented using an unsupervised machine learning approach for Bengali language;International Journal on Smart Sensing and Intelligent Systems;2024-04-01