Gradient Descent Provably Escapes Saddle Points in the Training of Shallow ReLU Networks-Reference-Cited by-同舟云学术

Gradient Descent Provably Escapes Saddle Points in the Training of Shallow ReLU Networks

Published:2024-09-10 Issue: Volume: Page:
ISSN:0022-3239
Container-title:Journal of Optimization Theory and Applications
language:en
Short-container-title:J Optim Theory Appl

Author:

Cheridito Patrick^ORCID,Jentzen Arnulf^ORCID,Rossmannek Florian^ORCID

Abstract

AbstractDynamical systems theory has recently been applied in optimization to prove that gradient descent algorithms bypass so-called strict saddle points of the loss function. However, in many modern machine learning applications, the required regularity conditions are not satisfied. In this paper, we prove a variant of the relevant dynamical systems result, a center-stable manifold theorem, in which we relax some of the regularity requirements. We explore its relevance for various machine learning tasks, with a particular focus on shallow rectified linear unit (ReLU) and leaky ReLU networks with scalar input. Building on a detailed examination of critical points of the square integral loss function for shallow ReLU and leaky ReLU networks relative to an affine target function, we show that gradient descent circumvents most saddle points. Furthermore, we prove convergence to global minima under favourable initialization conditions, quantified by an explicit threshold on the limiting loss.

Funder

Deutsche Forschungsgemeinschaft

HORIZON EUROPE European Research Council

Schmidt Futures

Universität Münster

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s10957-024-02513-3.pdf

Reference46 articles.

1. Absil, P.A., Mahony, R., Andrews, B.: Convergence of the iterates of descent methods for analytic cost functions. SIAM J. Optim. 16(2), 531–547 (2005)

2. Allen-Zhu, Z., Li, Y., Song, Z.: A convergence theory for deep learning via over-parameterization. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning (09–15 Jun 2019), Proceedings of Machine Learning Research, vol. 97, pp. 242–252. PMLR

3. Bah, B., Rauhut, H., Terstiege, U., Westdickenberg, M.: Learning deep linear neural networks: Riemannian gradient flows and convergence to global minimizers. Inf. Inference J. IMA 11, 307–353 (2021)

4. Bhojanapalli, S., Neyshabur, B., Srebro, N.: Global optimality of local search for low rank matrix recovery. In: Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 29. Curran Associates, Inc (2016)

5. Cheridito, P., Jentzen, A., Riekert, A., Rossmannek, F.: A proof of convergence for gradient descent in the training of artificial neural networks for constant target functions. J. Complex. 72, 101646 (2022)