1Cademy - AdaGrad Update Rule

Learn Before

Adagrad

Formula

AdaGrad Update Rule

The AdaGrad algorithm updates the parameters of a model by maintaining a state variable $\mathbf{s}_t$ that accumulates the element-wise squares of past gradients. At each step $t$ , the gradient $\mathbf{g}_t = \partial_{\mathbf{w}} l(y_t, f(\mathbf{x}_t, \mathbf{w}))$ is computed. The state variable is updated as $\mathbf{s}_t = \mathbf{s}_{t-1} + \mathbf{g}_t^2$ , initialized with $\mathbf{s}_0 = \mathbf{0}$ . The parameter vector $\mathbf{w}$ is then updated coordinate-wise according to the rule: $\mathbf{w}_t = \mathbf{w}_{t-1} - \frac{\eta}{\sqrt{\mathbf{s}_t + \epsilon}} \cdot \mathbf{g}_t$ , where $\eta$ is the initial learning rate and $\epsilon$ is a small additive constant used to prevent division by zero. This formulation ensures that each coordinate has its own adaptive learning rate based on its historical gradient variance.

0

2

Updated 2026-05-15

Contributors are:

Who are from:

University of California, Berkeley

References

Learn Before

Related

Learn After