Learn Before
Formula

AdaGrad Update Rule

The AdaGrad algorithm updates the parameters of a model by maintaining a state variable st\mathbf{s}_t that accumulates the element-wise squares of past gradients. At each step tt, the gradient gt=wl(yt,f(xt,w))\mathbf{g}_t = \partial_{\mathbf{w}} l(y_t, f(\mathbf{x}_t, \mathbf{w})) is computed. The state variable is updated as st=st1+gt2\mathbf{s}_t = \mathbf{s}_{t-1} + \mathbf{g}_t^2, initialized with s0=0\mathbf{s}_0 = \mathbf{0}. The parameter vector w\mathbf{w} is then updated coordinate-wise according to the rule: wt=wt1ηst+ϵgt\mathbf{w}_t = \mathbf{w}_{t-1} - \frac{\eta}{\sqrt{\mathbf{s}_t + \epsilon}} \cdot \mathbf{g}_t, where η\eta is the initial learning rate and ϵ\epsilon is a small additive constant used to prevent division by zero. This formulation ensures that each coordinate has its own adaptive learning rate based on its historical gradient variance.

0

2

Updated 2026-05-15

Tags

Deep Learning (in Machine learning)

Data Science

D2L

Dive into Deep Learning @ D2L

Computing Sciences