Formula

Adam State Variables

A key component of the Adam optimization algorithm is its use of exponential weighted moving averages, or leaky averaging, to estimate both the momentum and the second moment of the gradient. At each time step tt, it maintains two state variables: vtβ1vt1+(1β1)gt\mathbf{v}_t \leftarrow \beta_1 \mathbf{v}_{t-1} + (1 - \beta_1) \mathbf{g}_t and stβ2st1+(1β2)gt2\mathbf{s}_t \leftarrow \beta_2 \mathbf{s}_{t-1} + (1 - \beta_2) \mathbf{g}_t^2. The terms β1\beta_1 and β2\beta_2 are nonnegative weighting parameters. Common default choices are β1=0.9\beta_1 = 0.9 and β2=0.999\beta_2 = 0.999, which ensures that the variance estimate st\mathbf{s}_t adapts much more slowly than the momentum term vt\mathbf{v}_t.

0

1

Updated 2026-05-16

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L