Deep Neural Networks Training by Stochastic Quasi-Newton Trust-Region Methods-Reference-Cited by-同舟云学术

Deep Neural Networks Training by Stochastic Quasi-Newton Trust-Region Methods

Published:2023-10-20 Issue:10 Volume:16 Page:490
ISSN:1999-4893
Container-title:Algorithms
language:en
Short-container-title:Algorithms

Author:

Yousefi Mahsa¹^ORCID,Martínez Ángeles¹^ORCID

Affiliation:

1. Department of Mathematics and Geoscienzes, University of Trieste, 34127 Trieste, Italy

Abstract

While first-order methods are popular for solving optimization problems arising in deep learning, they come with some acute deficiencies. To overcome these shortcomings, there has been recent interest in introducing second-order information through quasi-Newton methods that are able to construct Hessian approximations using only gradient information. In this work, we study the performance of stochastic quasi-Newton algorithms for training deep neural networks. We consider two well-known quasi-Newton updates, the limited-memory Broyden–Fletcher–Goldfarb–Shanno (BFGS) and the symmetric rank one (SR1). This study fills a gap concerning the real performance of both updates in the minibatch setting and analyzes whether more efficient training can be obtained when using the more robust BFGS update or the cheaper SR1 formula, which—allowing for indefinite Hessian approximations—can potentially help to better navigate the pathological saddle points present in the non-convex loss functions found in deep learning. We present and discuss the results of an extensive experimental study that includes many aspects affecting performance, like batch normalization, the network architecture, the limited memory parameter or the batch size. Our results show that stochastic quasi-Newton algorithms are efficient and, in some instances, able to outperform the well-known first-order Adam optimizer, run with the optimal combination of its numerous hyperparameters, and the stochastic second-order trust-region STORM algorithm.

Publisher

MDPI AG

Subject

Computational Mathematics,Computational Theory and Mathematics,Numerical Analysis,Theoretical Computer Science

Link

https://www.mdpi.com/1999-4893/16/10/490/pdf

Reference49 articles.

1. A stochastic approximation method;Robbins;Ann. Math. Stat.,1951

2. Large-scale online learning;Bottou;Adv. Neural Inf. Process. Syst.,2004

3. Defazio, A., Bach, F., and Lacoste-Julien, S. (2014, January 8–13). SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada.

4. Accelerating stochastic gradient descent using predictive variance reduction;Johnson;Adv. Neural Inf. Process. Syst.,2013

5. Minimizing finite sums with the stochastic average gradient;Schmidt;Math. Program.,2017

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A non-monotone trust-region method with noisy oracles and additional sampling;Computational Optimization and Applications;2024-05-31