Similar to other optimization algorithms, the AdaGrad optimizer can be implemented concisely using high-level APIs in modern deep learning frameworks. Instead of manually maintaining auxiliary state variables and writing the coordinate-wise update logic from scratch, developers can directly instantiate built-in optimizer classes. For instance, in PyTorch, this is achieved by invoking torch.optim.Adagrad; in MXNet's Gluon API, by specifying the algorithm as 'adagrad'; and in TensorFlow, by using tf.keras.optimizers.Adagrad. These built-in implementations handle the internal accumulation of squared gradients and numerical stability adjustments automatically, allowing for streamlined model training.

Claude

The Adagrad optimization algorithm addresses the difficulty of tuning learning rates for sparse features by replacing simple feature occurrence counters with an aggregate of the squares of previously observed gradients. Specifically, it uses $$s(i, t+1) = s(i, t) + \left(\partial_i f(\mathbf{x})\right)^2$$ to adjust the learning rate. This automatically scales down the step size significantly for coordinates that frequently have large gradients, while applying a gentler treatment to coordinates with small gradients, thereby eliminating the need to manually decide when a gradient is considered large enough.

Adagrad

Dive into Deep Learning

The AdaGrad algorithm updates the parameters of a model by maintaining a state variable $$\mathbf{s}_t$$ that accumulates the element-wise squares of past gradients. At each step $$t$$, the gradient $$\mathbf{g}_t = \partial_{\mathbf{w}} l(y_t, f(\mathbf{x}_t, \mathbf{w}))$$ is computed. The state variable is updated as $$\mathbf{s}_t = \mathbf{s}_{t-1} + \mathbf{g}_t^2$$, initialized with $$\mathbf{s}_0 = \mathbf{0}$$. The parameter vector $$\mathbf{w}$$ is then updated coordinate-wise according to the rule: $$\mathbf{w}_t = \mathbf{w}_{t-1} - \frac{\eta}{\sqrt{\mathbf{s}_t + \epsilon}} \cdot \mathbf{g}_t$$, where $$\eta$$ is the initial learning rate and $$\epsilon$$ is a small additive constant used to prevent division by zero. This formulation ensures that each coordinate has its own adaptive learning rate based on its historical gradient variance.

AdaGrad Update Rule

To observe the behavior of AdaGrad in a quadratic convex problem, we can apply it to the two-dimensional function $$f(\mathbf{x}) = 0.1 x_1^2 + 2 x_2^2$$. In Python, the coordinate-wise update can be expressed as:

```python
import math

def adagrad_2d(x1, x2, s1, s2, eta):
    eps = 1e-6
    g1, g2 = 0.2 * x1, 4 * x2
    s1 += g1 ** 2
    s2 += g2 ** 2
    x1 -= eta / math.sqrt(s1 + eps) * g1
    x2 -= eta / math.sqrt(s2 + eps) * g2
    return x1, x2, s1, s2
```

When optimized with a standard learning rate (e.g., $$\eta = 0.4$$), the trajectory is initially smooth, but the independent variables stop moving early due to the cumulative effect of the state variable $$\mathbf{s}_t$$ continuously decaying the learning rate. Increasing the initial learning rate to a much larger value (e.g., $$\eta = 2$$) yields better convergence behavior, demonstrating that AdaGrad's learning rate decrease can be quite aggressive and may require careful hyperparameter selection.

AdaGrad Optimization Trajectory in 2D

Concise AdaGrad Implementation

A key characteristic of AdaGrad is that it accumulates squared gradients in a state variable $$\mathbf{s}_t$$. Because this sum continuously grows at an approximately linear rate, the effective per-coordinate learning rate decays at a rate of $$\mathcal{O}(t^{-\frac{1}{2}})$$. While this rapid decay ensures convergence and is perfectly adequate for convex optimization problems, it is often too aggressive for deep learning applications. In training deep neural networks, this continuous decay can cause the learning rate to diminish prematurely, resulting in the independent variables moving very little during the later stages of iteration and potentially halting learning before an optimal solution is reached.

Aggressive Learning Rate Decay in AdaGrad

**Pros:**
- **Automatic parameter-wise scaling:** AdaGrad eliminates the need to manually tune the learning rate by dynamically adjusting it for each coordinate based on the history of observed gradients.
- **Sparse feature robustness:** It applies a gentler learning rate decay to coordinates with infrequent gradients, allowing for larger updates on sparse features.

**Cons:**
- **Aggressive learning rate decay:** Because the algorithm continually accumulates squared gradients, the denominator grows monotonically, causing the coordinate-wise learning rate to decay at a rate of $$\mathcal{O}(t^{-\frac{1}{2}})$$. In deep learning, this decay can be too aggressive, causing the learning rate to approach zero and prematurely halt the learning process.

Learn Before

Related