To observe the behavior of AdaGrad in a quadratic convex problem, we can apply it to the two-dimensional function $$f(\mathbf{x}) = 0.1 x_1^2 + 2 x_2^2$$. In Python, the coordinate-wise update can be expressed as:

```python
import math

def adagrad_2d(x1, x2, s1, s2, eta):
    eps = 1e-6
    g1, g2 = 0.2 * x1, 4 * x2
    s1 += g1 ** 2
    s2 += g2 ** 2
    x1 -= eta / math.sqrt(s1 + eps) * g1
    x2 -= eta / math.sqrt(s2 + eps) * g2
    return x1, x2, s1, s2
```

When optimized with a standard learning rate (e.g., $$\eta = 0.4$$), the trajectory is initially smooth, but the independent variables stop moving early due to the cumulative effect of the state variable $$\mathbf{s}_t$$ continuously decaying the learning rate. Increasing the initial learning rate to a much larger value (e.g., $$\eta = 2$$) yields better convergence behavior, demonstrating that AdaGrad's learning rate decrease can be quite aggressive and may require careful hyperparameter selection.

AdaGrad Optimization Trajectory in 2D

Although the AdaGrad algorithm requires maintaining an auxiliary state variable $$\mathbf{s}_t$$ to allow for an individual learning rate per coordinate, this additional operation does not significantly increase its computational cost relative to standard stochastic gradient descent (SGD). The storage and element-wise arithmetic required to update $$\mathbf{s}_t$$ are relatively inexpensive, simply because the primary computational expense in optimizing deep learning models remains the forward pass to evaluate the objective function $$l(y_t, f(\mathbf{x}_t, \mathbf{w}))$$ and the backward pass to compute its derivative.

Computational Cost of AdaGrad

A from-scratch implementation of the AdaGrad optimization algorithm requires maintaining an auxiliary state variable for each parameter tensor, initialized to zeros with the identical shape. During an update step, the element-wise square of the gradient is accumulated into this state. The parameter is then updated using an individually scaled learning rate: it is decremented by the product of the initial learning rate and the gradient, divided by the square root of the accumulated state plus a small constant (e.g., `1e-6`) for numerical stability. Finally, the parameter gradients are zeroed out. Because AdaGrad continuously scales down the effective learning rate through this state variable, training typically requires a larger initial learning rate compared to standard mini-batch stochastic gradient descent.

In PyTorch, this can be implemented as follows:
```python
def init_adagrad_states(feature_dim):
    s_w = torch.zeros((feature_dim, 1))
    s_b = torch.zeros(1)
    return (s_w, s_b)

def adagrad(params, states, hyperparams):
    eps = 1e-6
    for p, s in zip(params, states):
        with torch.no_grad():
            s[:] += torch.square(p.grad)
            p[:] -= hyperparams['lr'] * p.grad / torch.sqrt(s + eps)
        p.grad.data.zero_()
```

AdaGrad Optimizer From-Scratch Implementation

The AdaGrad algorithm updates the parameters of a model by maintaining a state variable $$\mathbf{s}_t$$ that accumulates the element-wise squares of past gradients. At each step $$t$$, the gradient $$\mathbf{g}_t = \partial_{\mathbf{w}} l(y_t, f(\mathbf{x}_t, \mathbf{w}))$$ is computed. The state variable is updated as $$\mathbf{s}_t = \mathbf{s}_{t-1} + \mathbf{g}_t^2$$, initialized with $$\mathbf{s}_0 = \mathbf{0}$$. The parameter vector $$\mathbf{w}$$ is then updated coordinate-wise according to the rule: $$\mathbf{w}_t = \mathbf{w}_{t-1} - \frac{\eta}{\sqrt{\mathbf{s}_t + \epsilon}} \cdot \mathbf{g}_t$$, where $$\eta$$ is the initial learning rate and $$\epsilon$$ is a small additive constant used to prevent division by zero. This formulation ensures that each coordinate has its own adaptive learning rate based on its historical gradient variance.

University of California, Berkeley

Claude

The Adagrad optimization algorithm addresses the difficulty of tuning learning rates for sparse features by replacing simple feature occurrence counters with an aggregate of the squares of previously observed gradients. Specifically, it uses $$s(i, t+1) = s(i, t) + \left(\partial_i f(\mathbf{x})\right)^2$$ to adjust the learning rate. This automatically scales down the step size significantly for coordinates that frequently have large gradients, while applying a gentler treatment to coordinates with small gradients, thereby eliminating the need to manually decide when a gradient is considered large enough.

Adagrad

Dive into Deep Learning

 Pros:

 - No need to care about the learning rate because it changes by itself

Cons:

 - Sometimes the sum of squared gradients can get very big and therefore the learning rate can become equal to zero and model would stop learning

Pros and Cons

AdaGrad Update Rule

Similar to other optimization algorithms, the AdaGrad optimizer can be implemented concisely using high-level APIs in modern deep learning frameworks. Instead of manually maintaining auxiliary state variables and writing the coordinate-wise update logic from scratch, developers can directly instantiate built-in optimizer classes. For instance, in PyTorch, this is achieved by invoking torch.optim.Adagrad; in MXNet's Gluon API, by specifying the algorithm as 'adagrad'; and in TensorFlow, by using tf.keras.optimizers.Adagrad. These built-in implementations handle the internal accumulation of squared gradients and numerical stability adjustments automatically, allowing for streamlined model training.

Concise AdaGrad Implementation

A key characteristic of AdaGrad is that it accumulates squared gradients in a state variable $$\mathbf{s}_t$$. Because this sum continuously grows at an approximately linear rate, the effective per-coordinate learning rate decays at a rate of $$\mathcal{O}(t^{-\frac{1}{2}})$$. While this rapid decay ensures convergence and is perfectly adequate for convex optimization problems, it is often too aggressive for deep learning applications. In training deep neural networks, this continuous decay can cause the learning rate to diminish prematurely, resulting in the independent variables moving very little during the later stages of iteration and potentially halting learning before an optimal solution is reached.

Learn Before

Related

Learn After