Although the AdaGrad algorithm requires maintaining an auxiliary state variable $$\mathbf{s}_t$$ to allow for an individual learning rate per coordinate, this additional operation does not significantly increase its computational cost relative to standard stochastic gradient descent (SGD). The storage and element-wise arithmetic required to update $$\mathbf{s}_t$$ are relatively inexpensive, simply because the primary computational expense in optimizing deep learning models remains the forward pass to evaluate the objective function $$l(y_t, f(\mathbf{x}_t, \mathbf{w}))$$ and the backward pass to compute its derivative.

Claude

The AdaGrad algorithm updates the parameters of a model by maintaining a state variable $$\mathbf{s}_t$$ that accumulates the element-wise squares of past gradients. At each step $$t$$, the gradient $$\mathbf{g}_t = \partial_{\mathbf{w}} l(y_t, f(\mathbf{x}_t, \mathbf{w}))$$ is computed. The state variable is updated as $$\mathbf{s}_t = \mathbf{s}_{t-1} + \mathbf{g}_t^2$$, initialized with $$\mathbf{s}_0 = \mathbf{0}$$. The parameter vector $$\mathbf{w}$$ is then updated coordinate-wise according to the rule: $$\mathbf{w}_t = \mathbf{w}_{t-1} - \frac{\eta}{\sqrt{\mathbf{s}_t + \epsilon}} \cdot \mathbf{g}_t$$, where $$\eta$$ is the initial learning rate and $$\epsilon$$ is a small additive constant used to prevent division by zero. This formulation ensures that each coordinate has its own adaptive learning rate based on its historical gradient variance.

AdaGrad Update Rule

Dive into Deep Learning

To observe the behavior of AdaGrad in a quadratic convex problem, we can apply it to the two-dimensional function $$f(\mathbf{x}) = 0.1 x_1^2 + 2 x_2^2$$. In Python, the coordinate-wise update can be expressed as:

```python
import math

def adagrad_2d(x1, x2, s1, s2, eta):
    eps = 1e-6
    g1, g2 = 0.2 * x1, 4 * x2
    s1 += g1 ** 2
    s2 += g2 ** 2
    x1 -= eta / math.sqrt(s1 + eps) * g1
    x2 -= eta / math.sqrt(s2 + eps) * g2
    return x1, x2, s1, s2
```

When optimized with a standard learning rate (e.g., $$\eta = 0.4$$), the trajectory is initially smooth, but the independent variables stop moving early due to the cumulative effect of the state variable $$\mathbf{s}_t$$ continuously decaying the learning rate. Increasing the initial learning rate to a much larger value (e.g., $$\eta = 2$$) yields better convergence behavior, demonstrating that AdaGrad's learning rate decrease can be quite aggressive and may require careful hyperparameter selection.

AdaGrad Optimization Trajectory in 2D

Computational Cost of AdaGrad

A from-scratch implementation of the AdaGrad optimization algorithm requires maintaining an auxiliary state variable for each parameter tensor, initialized to zeros with the identical shape. During an update step, the element-wise square of the gradient is accumulated into this state. The parameter is then updated using an individually scaled learning rate: it is decremented by the product of the initial learning rate and the gradient, divided by the square root of the accumulated state plus a small constant (e.g., `1e-6`) for numerical stability. Finally, the parameter gradients are zeroed out. Because AdaGrad continuously scales down the effective learning rate through this state variable, training typically requires a larger initial learning rate compared to standard mini-batch stochastic gradient descent.

In PyTorch, this can be implemented as follows:
```python
def init_adagrad_states(feature_dim):
    s_w = torch.zeros((feature_dim, 1))
    s_b = torch.zeros(1)
    return (s_w, s_b)

def adagrad(params, states, hyperparams):
    eps = 1e-6
    for p, s in zip(params, states):
        with torch.no_grad():
            s[:] += torch.square(p.grad)
            p[:] -= hyperparams['lr'] * p.grad / torch.sqrt(s + eps)
        p.grad.data.zero_()
```

Learn Before

Related