Affiliation:
1. School of Automation, Beijing Institute of Technology, Beijing, China
2. Infrastructure Inspection Research Institute, China Academy of Railway Sciences, Beijing, China
Abstract
Current mainstream deep learning optimization algorithms can be classified into two categories: non-adaptive optimization algorithms, such as Stochastic Gradient Descent with Momentum (SGDM), and adaptive optimization algorithms, like Adaptive Moment Estimation with Weight Decay (AdamW). Adaptive optimization algorithms for many deep neural network models typically enable faster initial training, whereas non-adaptive optimization algorithms often yield better final convergence. Our proposed Adaptive Learning Rate Burst (Adaburst) algorithm seeks to combine the strengths of both categories. The update mechanism of Adaburst incorporates elements from AdamW and SGDM, ensuring a seamless transition between the two. Adaburst modifies the learning rate of the SGDM algorithm based on a cosine learning rate schedule, particularly when the algorithm encounters an update bottleneck, which is called learning rate burst. This approach helps the model to escape current local optima more effectively. The results of the Adaburst experiment underscore its enhanced performance in image classification and generation tasks when compared with alternative approaches, characterized by expedited convergence and elevated accuracy. Notably, on the MNIST, CIFAR-10, and CIFAR-100 datasets, Adaburst attained accuracies that matched or exceeded those achieved by SGDM. Furthermore, in training diffusion models on the DeepFashion dataset, Adaburst achieved convergence in fewer epochs than a meticulously calibrated AdamW optimizer while avoiding abrupt blurring or other training instabilities. Adaburst augmented the final training set accuracy on the MNIST, CIFAR-10, and CIFAR-100 datasets by 0.02%, 0.41%, and 4.18%, respectively. In addition, the generative model trained on the DeepFashion dataset demonstrated a 4.62-point improvement in the Frechet Inception Distance (FID) score, a metric for assessing generative model quality. Consequently, this evidence suggests that Adaburst introduces an innovative optimization algorithm that simultaneously updates AdamW and SGDM and incorporates a learning rate burst mechanism. This mechanism significantly enhances deep neural networks’ training speed and convergence accuracy.