1. Qian N (1999) On the momentum term in gradient descent learning algorithms. Neural Netw 12(1):145–151
2. Nesterov Y (1983) A method for unconstrained convex minimization problem with the rate of convergence o (1/$$k^2$$). Dokl Ussr 269:543–547
3. Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12(7):2121–2159
4. Hinton G, Srivastava N, Swersky K (2012) Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on 14(8):2
5. Zeiler MD (2012) Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701