Guidelines for the Regularization of Gammas in Batch Normalization for Deep Residual Networks

Author:

Kim Bum Jun1ORCID,Choi Hyeyeon1ORCID,Jang Hyeonah1ORCID,Kim Sang Woo1ORCID

Affiliation:

1. Pohang University of Science and Technology, Pohang, Republic of Korea

Abstract

L 2 regularization for weights in neural networks is widely used as a standard training trick. In addition to weights, the use of batch normalization involves an additional trainable parameter γ, which acts as a scaling factor. However, L 2 regularization for γ remains an undiscussed mystery and is applied in different ways depending on the library and practitioner. In this article, we study whether L 2 regularization for γ is valid. To explore this issue, we consider two approaches: (1) variance control to make the residual network behave like an identity mapping and (2) stable optimization through the improvement of effective learning rate. Through two analyses, we specify the desirable and undesirable γ to apply L 2 regularization and propose four guidelines for managing them. In several experiments, we observed that applying L 2 regularization to applicable γ increased 1% to 4% classification accuracy, whereas applying L 2 regularization to inapplicable γ decreased 1% to 3% classification accuracy, which is consistent with our four guidelines. Our proposed guidelines were further validated through various tasks and architectures, including variants of residual networks and transformers.

Funder

Samsung Electronics Co., Ltd

Publisher

Association for Computing Machinery (ACM)

Reference37 articles.

1. Layer normalization;Ba Lei Jimmy;CoRR,2016

2. Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. 2014. Food-101 — Mining discriminative components with random forests. In ECCV (6), Vol. 8694. 446–461. https://data.vision.ee.ethz.ch/cvl/datasets_extra/food-101/

3. Andrew Brock, Soham De, and Samuel L. Smith. 2021. Characterizing signal propagation to close the performance gap in unnormalized ResNets. In ICLR.

4. Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, and Marcello Federico. 2014. Report on the 11th IWSLT evaluation campaign. In Proceedings of the 11th International Workshop on Spoken Language Translation: Evaluation Campaign. 2–17. https://workshop2014.iwslt.org/

5. Soham De and Samuel L. Smith. 2020. Batch normalization biases residual blocks towards the identity function in deep networks. In NeurIPS.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3