1. A Stochastic Approximation Method
2. On the importance of initialization and momentum in deep learning;Sutskever;ICML,2013
3. Adaptive subgradient methods for online learning and stochastic optimization;Duchi;JMLR,2011
4. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude;Tijmen Tieleman;Coursera: Neural networks for machine learning,2012
5. Adam: A method for stochastic optimization;Kingma;ICLR,2015