Shaping the learning landscape in neural networks around wide flat minima-Reference-Cited by-同舟云学术

Shaping the learning landscape in neural networks around wide flat minima

Published:2019-12-23 Issue:1 Volume:117 Page:161-170
ISSN:0027-8424
Container-title:Proceedings of the National Academy of Sciences
language:en
Short-container-title:Proc Natl Acad Sci USA

Author:

Baldassi Carlo^ORCID,Pittorino Fabrizio,Zecchina Riccardo^ORCID

Abstract

Learning in deep neural networks takes place by minimizing a nonconvex high-dimensional loss function, typically by a stochastic gradient descent (SGD) strategy. The learning process is observed to be able to find good minimizers without getting stuck in local critical points and such minimizers are often satisfactory at avoiding overfitting. How these 2 features can be kept under control in nonlinear devices composed of millions of tunable connections is a profound and far-reaching open question. In this paper we study basic nonconvex 1- and 2-layer neural network models that learn random patterns and derive a number of basic geometrical and algorithmic features which suggest some answers. We first show that the error loss function presents few extremely wide flat minima (WFM) which coexist with narrower minima and critical points. We then show that the minimizers of the cross-entropy loss function overlap with the WFM of the error loss. We also show examples of learning devices for which WFM do not exist. From the algorithmic perspective we derive entropy-driven greedy and message-passing algorithms that focus their search on wide flat regions of minimizers. In the case of SGD and cross-entropy loss, we show that a slow reduction of the norm of the weights along the learning process also leads to WFM. We corroborate the results by a numerical study of the correlations between the volumes of the minimizers, their Hessian, and their generalization performance on real data.

Funder

DOD | United States Navy | Office of Naval Research

Publisher

Proceedings of the National Academy of Sciences

Subject

Multidisciplinary

Reference30 articles.

1. D. J. MacKay , Information Theory, Inference and Learning Algorithms (Cambridge University Press, 2003).

2. Deep learning

3. Subdominant dense clusters allow for simple learning and high computational performance in neural networks with discrete synapses;Baldassi;Phys. Rev. Lett.,2015

4. Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes

5. N. S. Keskar , D. Mudigere , J. Nocedal , M. Smelyanskiy , P. T. P. Tang , On large-batch training for deep learning: Generalization gap and sharp minima. arXiv:1609.04836 (15 September 2016).

Cited by 45 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Regularization, early-stopping and dreaming: A Hopfield-like setup to address generalization and overfitting;Neural Networks;2024-09

2. Layer-Wise Adaptive Gradient Norm Penalizing Method for Efficient and Accurate Deep Learning;Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining;2024-08-24

3. FlatNAS: optimizing Flatness in Neural Architecture Search for Out-of-Distribution Robustness;2024 International Joint Conference on Neural Networks (IJCNN);2024-06-30

4. Impact of dendritic non-linearities on the computational capabilities of neurons;2024-06-28

5. Machine learning meets physics: A two-way street;Proceedings of the National Academy of Sciences;2024-06-24