A known limitation of the Adam optimization algorithm is its potential failure to converge, even in convex optimization settings. This divergence typically occurs when the second moment estimate, denoted as $$\mathbf{s}_t$$, blows up. Specifically, when the squared gradient $$\mathbf{g}_t^2$$ exhibits high variance or when parameter updates are sparse, the state variable $$\mathbf{s}_t$$ may forget its past values too rapidly, which destabilizes the learning process. These convergence issues can be amended by either increasing the size of the minibatches during training or by switching to an optimization algorithm that provides an improved estimate for $$\mathbf{s}_t$$, such as the Yogi optimizer.

Claude

A key component of the Adam optimization algorithm is its use of exponential weighted moving averages, or leaky averaging, to estimate both the momentum and the second moment of the gradient. At each time step $$t$$, it maintains two state variables: $$\mathbf{v}_t \leftarrow \beta_1 \mathbf{v}_{t-1} + (1 - \beta_1) \mathbf{g}_t$$ and $$\mathbf{s}_t \leftarrow \beta_2 \mathbf{s}_{t-1} + (1 - \beta_2) \mathbf{g}_t^2$$. The terms $$\beta_1$$ and $$\beta_2$$ are nonnegative weighting parameters. Common default choices are $$\beta_1 = 0.9$$ and $$\beta_2 = 0.999$$, which ensures that the variance estimate $$\mathbf{s}_t$$ adapts much more slowly than the momentum term $$\mathbf{v}_t$$.

Adam State Variables

Dive into Deep Learning

In the Adam optimizer, the state variables for momentum ($$\mathbf{v}_t$$) and the second moment ($$\mathbf{s}_t$$) are typically initialized to zero ($$\mathbf{v}_0 = \mathbf{s}_0 = 0$$). This initialization introduces a significant bias towards smaller values during the initial training steps. To correct this bias, Adam re-normalizes the terms using the sum of the weights $$\sum_{i=0}^{t-1} \beta^i = \frac{1 - \beta^t}{1 - \beta}$$. The resulting debiased, or normalized, state variables are computed as $$\hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1 - \beta_1^t}$$ and $$\hat{\mathbf{s}}_t = \frac{\mathbf{s}_t}{1 - \beta_2^t}$$.

Learn Before

Related