Abstract
We provide the first theoretical analysis on the convergence rate of asynchronous mini-batch gradient descent with variance reduction (AsySVRG) for non-convex optimization. Asynchronous stochastic gradient descent (AsySGD) has been broadly used for deep learning optimization, and it is proved to converge with rate of O(1/\sqrt{T}) for non-convex optimization. Recently, variance reduction technique is proposed and it is proved to be able to accelerate the convergence of SGD greatly. It is shown that asynchronous SGD method with variance reduction technique has linear convergence rate when problem is strongly convex. However, there is still no work to analyze the convergence rate of this method for non-convex problem. In this paper, we consider two asynchronous parallel implementations of mini-batch gradient descent method with variance reduction: one is on distributed-memory architecture and the other is on shared-memory architecture. We prove that both methods can converge with a rate of O(1/T) for non-convex optimization, and linear speedup is accessible when we increase the number of workers. We evaluate our methods by optimizing multi-layer neural networks on two real datasets (MNIST and CIFAR-10), and experimental results demonstrate our theoretical analysis.
Publisher
Association for the Advancement of Artificial Intelligence (AAAI)
Cited by
6 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献