1. Ahn K, Zhang J, Sra S (2022) Understanding the unstable convergence of gradient descent. In: Chaudhuri K, Jegelka S, Song L, Szepesvári C, Niu G, Sabato S (eds) International conference on machine learning, ICML 2022, 17–23 July 2022, Baltimore, Maryland, USA. Proceedings of machine learning research, vol 162. PMLR, pp 247–257. https://proceedings.mlr.press/v162/ahn22a.html
2. Antonakopoulos K, Mertikopoulos P, Piliouras G, Wang X (2022) AdaGrad avoids saddle points. In: Chaudhuri K, Jegelka S, Song L, Szepesvari C, Niu G, Sabato S (eds) Proceedings of the 39th international conference on machine learning. Proceedings of machine learning research, vol 162. PMLR (17–23 Jul 2022), pp 731–771. https://proceedings.mlr.press/v162/antonakopoulos22a.html
3. Cazenave T, Sentuc J, Videau M (2022) Cosine annealing, mixnet and swish activation for computer go. In: Browne C, Kishimoto A, Schaeffer J (eds) Advances in computer games. Springer International Publishing, Cham, pp 53–60
4. Duchi JC, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12:2121–2159. https://doi.org/10.5555/1953048.2021068
5. Ghojogh B, Ghojogh A, Crowley M, Karray F (2019) Fitting a mixture distribution to data: tutorial. arXiv:1901.06708