1. Robbins H, Monro S. A stochastic approximation method. Ann Math Statist, 1951, 22: 400–407
2. Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learning Res, 2011, 12: 2121–2159
3. Sutskever I, Martens J, Dahl G, et al. On the importance of initialization and momentum in deep learning. In: Proceedings of International Conference on Machine Learning, 2013. 1139–1147
4. Ben-Nun T, Hoefler T. Demystifying parallel and distributed deep learning. ACM Comput Surv, 2020, 52: 1–43
5. Dean J, Corrado G, Monga R, et al. Large scale distributed deep networks. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2012. 1223–1231