Affiliation:
1. Institute of Mathematics EPFL Lausanne Swizterland
2. MIT
Abstract
AbstractThe goal of this paper is to characterize function distributions that general neural networks trained by descent algorithms (GD/SGD), can or cannot learn in polytime. The results are: (1) The paradigm of general neural networks trained by SGD is poly‐time universal: any function distribution that can be learned from samples in polytime can also be learned by a poly‐size neural net trained by SGD with polynomial parameters. In particular, this can be achieved despite polynomial noise on the gradients, implying a separation result between SGD‐based deep learning and statistical query algorithms, as the latter are not comparably universal due to cases like parities. This also shows that deep learning does not suffer from the limitations of shallow networks. (2) The paper further gives a lower‐bound on the generalization error of descent algorithms, which relies on two quantities: the cross‐predictability, an average‐case quantity related to the statistical dimension, and the null‐flow, a quantity specific to descent algorithms. The lower‐bound implies in particular that for functions of low enough cross‐predictability, the above robust universality breaks down once the gradients are averaged over too many samples (as in perfect GD) rather than fewer (as in SGD). (3) Finally, it is shown that if larger amounts of noise are added on the initialization and on the gradients, then SGD is no longer comparably universal due again to distributions having low enough cross‐predictability.
Subject
Applied Mathematics,General Mathematics