The AdaGrad algorithm updates the parameters of a model by maintaining a state variable $$\mathbf{s}_t$$ that accumulates the element-wise squares of past gradients. At each step $$t$$, the gradient $$\mathbf{g}_t = \partial_{\mathbf{w}} l(y_t, f(\mathbf{x}_t, \mathbf{w}))$$ is computed. The state variable is updated as $$\mathbf{s}_t = \mathbf{s}_{t-1} + \mathbf{g}_t^2$$, initialized with $$\mathbf{s}_0 = \mathbf{0}$$. The parameter vector $$\mathbf{w}$$ is then updated coordinate-wise according to the rule: $$\mathbf{w}_t = \mathbf{w}_{t-1} - \frac{\eta}{\sqrt{\mathbf{s}_t + \epsilon}} \cdot \mathbf{g}_t$$, where $$\eta$$ is the initial learning rate and $$\epsilon$$ is a small additive constant used to prevent division by zero. This formulation ensures that each coordinate has its own adaptive learning rate based on its historical gradient variance.

AdaGrad Update Rule

To observe the behavior of AdaGrad in a quadratic convex problem, we can apply it to the two-dimensional function $$f(\mathbf{x}) = 0.1 x_1^2 + 2 x_2^2$$. In Python, the coordinate-wise update can be expressed as:

```python
import math

def adagrad_2d(x1, x2, s1, s2, eta):
    eps = 1e-6
    g1, g2 = 0.2 * x1, 4 * x2
    s1 += g1 ** 2
    s2 += g2 ** 2
    x1 -= eta / math.sqrt(s1 + eps) * g1
    x2 -= eta / math.sqrt(s2 + eps) * g2
    return x1, x2, s1, s2
```

When optimized with a standard learning rate (e.g., $$\eta = 0.4$$), the trajectory is initially smooth, but the independent variables stop moving early due to the cumulative effect of the state variable $$\mathbf{s}_t$$ continuously decaying the learning rate. Increasing the initial learning rate to a much larger value (e.g., $$\eta = 2$$) yields better convergence behavior, demonstrating that AdaGrad's learning rate decrease can be quite aggressive and may require careful hyperparameter selection.

AdaGrad Optimization Trajectory in 2D

Similar to other optimization algorithms, the AdaGrad optimizer can be implemented concisely using high-level APIs in modern deep learning frameworks. Instead of manually maintaining auxiliary state variables and writing the coordinate-wise update logic from scratch, developers can directly instantiate built-in optimizer classes. For instance, in PyTorch, this is achieved by invoking torch.optim.Adagrad; in MXNet's Gluon API, by specifying the algorithm as 'adagrad'; and in TensorFlow, by using tf.keras.optimizers.Adagrad. These built-in implementations handle the internal accumulation of squared gradients and numerical stability adjustments automatically, allowing for streamlined model training.

Concise AdaGrad Implementation

A key characteristic of AdaGrad is that it accumulates squared gradients in a state variable $$\mathbf{s}_t$$. Because this sum continuously grows at an approximately linear rate, the effective per-coordinate learning rate decays at a rate of $$\mathcal{O}(t^{-\frac{1}{2}})$$. While this rapid decay ensures convergence and is perfectly adequate for convex optimization problems, it is often too aggressive for deep learning applications. In training deep neural networks, this continuous decay can cause the learning rate to diminish prematurely, resulting in the independent variables moving very little during the later stages of iteration and potentially halting learning before an optimal solution is reached.

Aggressive Learning Rate Decay in AdaGrad

**Pros:**
- **Automatic parameter-wise scaling:** AdaGrad eliminates the need to manually tune the learning rate by dynamically adjusting it for each coordinate based on the history of observed gradients.
- **Sparse feature robustness:** It applies a gentler learning rate decay to coordinates with infrequent gradients, allowing for larger updates on sparse features.

**Cons:**
- **Aggressive learning rate decay:** Because the algorithm continually accumulates squared gradients, the denominator grows monotonically, causing the coordinate-wise learning rate to decay at a rate of $$\mathcal{O}(t^{-\frac{1}{2}})$$. In deep learning, this decay can be too aggressive, causing the learning rate to approach zero and prematurely halt the learning process.

Pros and Cons of AdaGrad

The Adagrad optimization algorithm addresses the difficulty of tuning learning rates for sparse features by replacing simple feature occurrence counters with an aggregate of the squares of previously observed gradients. Specifically, it uses $$s(i, t+1) = s(i, t) + \left(\partial_i f(\mathbf{x})\right)^2$$ to adjust the learning rate. This automatically scales down the step size significantly for coordinates that frequently have large gradients, while applying a gentler treatment to coordinates with small gradients, thereby eliminating the need to manually decide when a gradient is considered large enough.

University of California, Berkeley

Claude

University of Michigan - Ann Arbor

When a neural network trains, it uses an algorithm to determine the optimal weights for the model, called an optimizer. 
There are several optimizer algorithms, such as:
- Gradient descent
- Mini-batch gradient descent
- Gradient descent with momentum
- RMSprop
- Adam optimization algorithm
- Nesterov momentum
- AdaGrad

Deep Learning Optimizer Algorithms

To address the learning rate dilemma for sparse features, one approach is to adjust the learning rate based on feature occurrence. Instead of a global time-based decay $$\eta = \frac{\eta_0}{\sqrt{t + c}}$$, a feature-specific rate $$\eta_i = \frac{\eta_0}{\sqrt{s(i, t) + c}}$$ can be used, where $$s(i, t)$$ counts the number of nonzeros for feature $$i$$ observed up to time $$t$$. However, this method fails for data that is not strictly sparse but instead has gradients that are mostly very small and only rarely large, as it is difficult to define a clear threshold for counting a feature as observed.

Feature Count-Based Learning Rate Adjustment

Here is a very helpful article on different types of optimizer algorithms
https://ruder.io/optimizing-gradient-descent/index.html

An overview of gradient descent optimization algorithms

Dive into Deep Learning

Batch Gradient Descent requires the entire dataset to be processed to complete one step. Gradient Descent can require many steps in some cases which causes this procedure to be very slow and inefficient for large datasets.

Mini-Batch Gradient Descent solves this problem by taking small groups of random data points called mini-batches and using them to estimate each step.  The mini-batches are usually a size greater than 1 but less than N (dataset size). The smaller steps allow for a much faster algorithm.

Mini-Batch Gradient Descent

The basic idea of gradient descent with momentum is to compute an exponentially weighted average of your gradients, and then use that gradient to update your weights instead. It almost always works faster than the standard gradient descent algorithm.


Gradient Descent with Momentum

Learning rate decay is the gradual reduction of the learning rate as a function of time to speed up the learning algorithm. Decaying the learning rate as the gradient descent approaches completion reduces noise and facilitates a tighter convergence to a target.

Learning Rate Decay

Gradient descent is a fundamental optimization algorithm that leverages gradients to minimize a model's loss function. Because the gradient of a function points in the direction of steepest ascent, moving the model's parameters in the opposite direction iteratively lowers the loss. Each step of such gradient-based optimization algorithms requires calculating the exact gradient of the loss with respect to the parameters.

Gradient Descent

Adam stands for adaptive moment estimation.
It combines gradient descent with momentum, and RMSProp. It brings the benefits from both sides - adaptive learning rate and faster convergence with momentum.

Adam (Deep Learning Optimization Algorithm)

- Stands for Root Mean Square Propagation
- RMSProp is an optimization algorithm closely related to AdaGrad, as both employ the square of the gradient to scale the update coefficients on a per-coordinate basis. However, RMSProp overcomes AdaGrad's tendency for radically diminishing learning rates by using a leaky (exponentially weighted) average of squared gradients rather than a cumulative sum.
- RMSProp also shares the leaky averaging mechanism with the momentum method, but applies it differently: whereas momentum uses leaky averaging to smooth the gradient direction, RMSProp uses the technique to adjust the coefficient-wise preconditioner that rescales the learning rate independently for each parameter.
- Because RMSProp does not automatically schedule the learning rate (unlike AdaGrad, whose learning rate decays implicitly through accumulation), the learning rate must be explicitly scheduled by the practitioner in practice.
- The decay coefficient $$\gamma$$ governs how long the gradient history is retained when adjusting the per-coordinate scale: a larger $$\gamma$$ produces a longer memory, while a smaller $$\gamma$$ makes the algorithm more responsive to recent gradients.

RMSprop (Deep Learning Optimization Algorithm)

In the momentum method we basically first moved our weight in the direction of the current gradient and then moved in the direction  of momentum (weighted sum of all previous steps). Now in the new method we first move in the direction of the momentum and then calculate the gradient at the new point. Using this gradient we move in the direction of the new gradient. 

The update rules are as follows:
$$v \leftarrow \alpha v - \epsilon \nabla_{\theta} [\frac{1}{m} \sum^{m}_{i=1} L(f(x^{(i)};\theta + \alpha v), y^{(i)})]$$
$$\theta \leftarrow \theta + v$$

Nesterov momentum (Deep Learning Optimization Algorithm)

Adam stands for: adaptive moment estimation. Briefly, this method combines momentum and RMSprop (root mean squared prop).
Like momentum alone, RMSprop smooths the gradient, (it takes RMSProp and applies momentum to the rescaled gradients). This alternative approach is best explained mathematically:

Adam introduces four hyperparameters:
- learning rate alpha
- beta from momentum (usually 0.9)
- beta2 from RMSprop (usually 0.999)
- epsilon (usually 1e-8)

As mentioned above, you usually do not need to tune beta, beta2, and epsilon as the values listed above will generally work well. Only the learning rate is left to tune in order to accelerate training.


Adam combines the advantages of AdaGrad and RMSProp these two optimization algorithms. It comprehensively considers the first moment estimation of the gradient (First Moment Estimation, the mean value of the gradient) and the second moment estimation (Second Moment Estimation, the uncentered variance of the gradient), and calculates the update step size.

Adam optimization algorithm


Adagrad

Adadelta is an optimization algorithm that has no explicit learning rate parameter. Instead, it uses the rate of change in the parameters themselves to dynamically adapt the learning rate. To accomplish this, the algorithm utilizes two specific state variables: $$\mathbf{s}_t$$ to track a leaky average of the second moment of the gradient, and $$\Delta\mathbf{x}_t$$ to track a leaky average of the second moment of the model's parameter changes. The algorithm retains standard naming conventions for these variables to maintain consistency with similar optimization methods like momentum, AdaGrad, and RMSProp.

Adadelta

Optimization algorithms in deep learning face several challenges:
- **Local Optima**: It is actually unlikely to get stuck in local optima.
- **Cliffs**: On the face of an extremely steep cliff structure, the gradient update step can move the parameters extremely far.
- **Inexact Gradients**: Sometimes approximation is needed for gradients when the exact gradient is intractable.
- **Plateaus**: A low cost function slope (close to flat) makes learning slow.

Learn Before

Related

Learn After