After computing the bias-corrected state variables, the Adam optimization algorithm calculates its final parameter updates. First, it rescales the gradient to obtain $$\mathbf{g}_t' = \frac{\eta \hat{\mathbf{v}}_t}{\sqrt{\hat{\mathbf{s}}_t} + \epsilon}$$. While similar to RMSProp, this rescaling uses the debiased momentum $$\hat{\mathbf{v}}_t$$ rather than the raw gradient, and the $$\epsilon$$ parameter (typically $$10^{-6}$$ for numerical stability) is added outside the square root. Finally, the model parameters are updated using the explicit learning rate $$\eta$$, which controls the step length, via the simple rule $$\mathbf{x}_t \leftarrow \mathbf{x}_{t-1} - \mathbf{g}_t'$$.

Adam Optimizer Update Rule

In the Adam optimizer, the state variables for momentum ($$\mathbf{v}_t$$) and the second moment ($$\mathbf{s}_t$$) are typically initialized to zero ($$\mathbf{v}_0 = \mathbf{s}_0 = 0$$). This initialization introduces a significant bias towards smaller values during the initial training steps. To correct this bias, Adam re-normalizes the terms using the sum of the weights $$\sum_{i=0}^{t-1} \beta^i = \frac{1 - \beta^t}{1 - \beta}$$. The resulting debiased, or normalized, state variables are computed as $$\hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1 - \beta_1^t}$$ and $$\hat{\mathbf{s}}_t = \frac{\mathbf{s}_t}{1 - \beta_2^t}$$.

Claude

A key component of the Adam optimization algorithm is its use of exponential weighted moving averages, or leaky averaging, to estimate both the momentum and the second moment of the gradient. At each time step $$t$$, it maintains two state variables: $$\mathbf{v}_t \leftarrow \beta_1 \mathbf{v}_{t-1} + (1 - \beta_1) \mathbf{g}_t$$ and $$\mathbf{s}_t \leftarrow \beta_2 \mathbf{s}_{t-1} + (1 - \beta_2) \mathbf{g}_t^2$$. The terms $$\beta_1$$ and $$\beta_2$$ are nonnegative weighting parameters. Common default choices are $$\beta_1 = 0.9$$ and $$\beta_2 = 0.999$$, which ensures that the variance estimate $$\mathbf{s}_t$$ adapts much more slowly than the momentum term $$\mathbf{v}_t$$.

Adam State Variables

Dive into Deep Learning

Adam Bias Correction

A known limitation of the Adam optimization algorithm is its potential failure to converge, even in convex optimization settings. This divergence typically occurs when the second moment estimate, denoted as $$\mathbf{s}_t$$, blows up. Specifically, when the squared gradient $$\mathbf{g}_t^2$$ exhibits high variance or when parameter updates are sparse, the state variable $$\mathbf{s}_t$$ may forget its past values too rapidly, which destabilizes the learning process. These convergence issues can be amended by either increasing the size of the minibatches during training or by switching to an optimization algorithm that provides an improved estimate for $$\mathbf{s}_t$$, such as the Yogi optimizer.

Learn Before

Related

Learn After