Author:
Wang Yijun,Zhou Pengyu,Zhong Wenya
Abstract
Despite superior training outcomes, adaptive optimization methods such as Adam, Adagrad or RMSprop have been found to generalize poorly compared to stochastic gradient descent (SGD). So scholars (Nitish Shirish Keskar et al., 2017) proposed a hybrid strategy to start training with Adam and switch to SGD at the right time. In the learning task with a large output space, it was observed that Adam could not converge to an optimal solution (or could not converge to an extreme point in a non-convex scene) [1]. Therefore, this paper proposes a new variant of the ADAM algorithm (AMSGRAD), which not only solves the convergence problem, but also improves the empirical performance.
Reference9 articles.
1. Robbins Herbert and Monro Sutton. A stochastic approximation method. The annals of mathematical statistics,pp. 400–407, 1951.
2. Kingma D. and Ba J. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR 2015), 2015
3. Tieleman T. and Hinton G. Lecture 6.5-RMSProp: Divide the gradient by a running average of its recent magni-tude. COURSERA:Neural Networks for Machine Learning, 4, 2012.
Cited by
12 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献