Optimizing Neural Networks with Adam: A Practical Guide to Learning Rates and Decay Schedulers

Vitality Learning
5 min readSep 18, 2024
Photo by Igor Omilaev on Unsplash

Training deep learning models requires selecting appropriate optimization algorithms and hyperparameters to ensure fast and stable convergence. One of the most widely used optimizers in neural networks is Adam (Adaptive Moment), which adapts the learning rate dynamically during training. In this post, we will break down the mathematics behind Adam, explore the role of learning rates, and show how to use learning rate schedulers to further improve model performance.

Mathematics Behind the Adam Optimizer

The Adam optimizer combines features from Momentum and RMSProp. It maintains both an exponentially decaying average of past gradients (first moment) and the squared gradients (second moment). The parameter update rule in Adam follows these steps:

Gradient Calculation:

which is the gradient of the loss L at time step t and where θt is the array of parameters to be optimized.

First Moment Estimate (Momentum):

where β1​ is the decay rate for the first moment, typically set to 0.9.

Second Moment Estimate (RMSProp):

--

--

Vitality Learning

We are teaching, researching and consulting parallel programming on Graphics Processing Units (GPUs) since the delivery of CUDA. We also play Matlab and Python.