1. A method of solving a convex programming problem with convergence rate O(1/k2);Nesterov;Sov. Math. Dokl.,1983
2. Optimization methods for large-scale machine learning;Bottou;SIAM Rev. Soc. Ind. Appl. Math.,2018
3. A differential equation for modeling nesterov’s accelerated gradient method: theory and insights;Su;J. Mach. Learn. Res.,2016
4. Understanding the acceleration phenomenon via high-resolution differential equations;Shi;Math. Prog.,2021
5. On the importance of initialization and momentum in deep learning;Sutskever;Proc. Mach. Learn. Res.,2013