Gradient Descent with Identity Initialization Efficiently Learns Positive-Definite Linear Transformations by Deep Residual Networks

Author:

Bartlett Peter L.1,Helmbold David P.2,Long Philip M.3

Affiliation:

1. Department of Statistics, University of California, Berkeley, Berkeley, CA 94720-3860, U.S.A.

2. Computer Science Department, University of California Santa Cruz, Santa Cruz, CA 95064, U.S.A.

3. Google, Mountain View, CA 94043, U.S.A.

Abstract

We analyze algorithms for approximating a function [Formula: see text] mapping [Formula: see text] to [Formula: see text] using deep linear neural networks, that is, that learn a function [Formula: see text] parameterized by matrices [Formula: see text] and defined by [Formula: see text]. We focus on algorithms that learn through gradient descent on the population quadratic loss in the case that the distribution over the inputs is isotropic. We provide polynomial bounds on the number of iterations for gradient descent to approximate the least-squares matrix [Formula: see text], in the case where the initial hypothesis [Formula: see text] has excess loss bounded by a small enough constant. We also show that gradient descent fails to converge for [Formula: see text] whose distance from the identity is a larger constant, and we show that some forms of regularization toward the identity in each layer do not help. If [Formula: see text] is symmetric positive definite, we show that an algorithm that initializes [Formula: see text] learns an [Formula: see text]-approximation of [Formula: see text] using a number of updates polynomial in [Formula: see text], the condition number of [Formula: see text], and [Formula: see text]. In contrast, we show that if the least-squares matrix [Formula: see text] is symmetric and has a negative eigenvalue, then all members of a class of algorithms that perform gradient descent with identity initialization, and optionally regularize toward the identity in each layer, fail to converge. We analyze an algorithm for the case that [Formula: see text] satisfies [Formula: see text] for all [Formula: see text] but may not be symmetric. This algorithm uses two regularizers: one that maintains the invariant [Formula: see text] for all [Formula: see text] and the other that “balances” [Formula: see text] so that they have the same singular values.

Publisher

MIT Press - Journals

Subject

Cognitive Neuroscience,Arts and Humanities (miscellaneous)

Reference34 articles.

Cited by 18 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Suboptimal Local Minima Exist for Wide Neural Networks with Smooth Activations;Mathematics of Operations Research;2022-11

2. Loss landscapes and optimization in over-parameterized non-linear systems and neural networks;Applied and Computational Harmonic Analysis;2022-07

3. Towards an Understanding of Residual Networks Using Neural Tangent Hierarchy (NTH);CSIAM Transactions on Applied Mathematics;2022-06

4. Asymptotic Convergence Rate of Dropout on Shallow Linear Neural Networks;Proceedings of the ACM on Measurement and Analysis of Computing Systems;2022-05-26

5. Understanding Dynamics of Nonlinear Representation Learning and Its Application;Neural Computation;2022-03-23

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3