ADADELTA: An Adaptive Learning Rate Method


The Adadelta algorithm updates parameters using a sequence of operations based on leaky averages. Given a decay parameter $$\rho$$, the state variable for the gradient's second moment is updated as $$\mathbf{s}_t = \rho \mathbf{s}_{t-1} + (1 - \rho) \mathbf{g}_t^2$$. A rescaled gradient $$\mathbf{g}_t'$$ is then computed using the ratio of the root mean square of previous parameter changes to the root mean square of the gradients: $$\mathbf{g}_t' = \frac{\sqrt{\Delta\mathbf{x}_{t-1} + \epsilon}}{\sqrt{{\mathbf{s}_t + \epsilon}}} \odot \mathbf{g}_t$$. The model parameters are updated by subtracting this rescaled gradient: $$\mathbf{x}_t = \mathbf{x}_{t-1} - \mathbf{g}_t'$$. Finally, the state variable tracking the parameter changes, initialized at $$\Delta \mathbf{x}_0 = 0$$, is updated as $$\Delta \mathbf{x}_t = \rho \Delta\mathbf{x}_{t-1} + (1 - \rho) {\mathbf{g}_t'}^2$$, where $$\epsilon$$ is a small constant (e.g., $$10^{-5}$$) added to maintain numerical stability.

Adadelta Update Rule

Adadelta is an optimization algorithm that has no explicit learning rate parameter. Instead, it uses the rate of change in the parameters themselves to dynamically adapt the learning rate. To accomplish this, the algorithm utilizes two specific state variables: $$\mathbf{s}_t$$ to track a leaky average of the second moment of the gradient, and $$\Delta\mathbf{x}_t$$ to track a leaky average of the second moment of the model's parameter changes. The algorithm retains standard naming conventions for these variables to maintain consistency with similar optimization methods like momentum, AdaGrad, and RMSProp.

Claude

University of Michigan - Ann Arbor

When a neural network trains, it uses an algorithm to determine the optimal weights for the model, called an optimizer. 
There are several optimizer algorithms, such as:
- Gradient descent
- Mini-batch gradient descent
- Gradient descent with momentum
- RMSprop
- Adam optimization algorithm
- Nesterov momentum
- AdaGrad

Deep Learning Optimizer Algorithms

- Stands for Root Mean Square Propagation
- RMSProp is an optimization algorithm closely related to AdaGrad, as both employ the square of the gradient to scale the update coefficients on a per-coordinate basis. However, RMSProp overcomes AdaGrad's tendency for radically diminishing learning rates by using a leaky (exponentially weighted) average of squared gradients rather than a cumulative sum.
- RMSProp also shares the leaky averaging mechanism with the momentum method, but applies it differently: whereas momentum uses leaky averaging to smooth the gradient direction, RMSProp uses the technique to adjust the coefficient-wise preconditioner that rescales the learning rate independently for each parameter.
- Because RMSProp does not automatically schedule the learning rate (unlike AdaGrad, whose learning rate decays implicitly through accumulation), the learning rate must be explicitly scheduled by the practitioner in practice.
- The decay coefficient $$\gamma$$ governs how long the gradient history is retained when adjusting the per-coordinate scale: a larger $$\gamma$$ produces a longer memory, while a smaller $$\gamma$$ makes the algorithm more responsive to recent gradients.

RMSprop (Deep Learning Optimization Algorithm)

Here is a very helpful article on different types of optimizer algorithms
https://ruder.io/optimizing-gradient-descent/index.html

An overview of gradient descent optimization algorithms

Dive into Deep Learning

Batch Gradient Descent requires the entire dataset to be processed to complete one step. Gradient Descent can require many steps in some cases which causes this procedure to be very slow and inefficient for large datasets.

Mini-Batch Gradient Descent solves this problem by taking small groups of random data points called mini-batches and using them to estimate each step.  The mini-batches are usually a size greater than 1 but less than N (dataset size). The smaller steps allow for a much faster algorithm.

Mini-Batch Gradient Descent

The basic idea of gradient descent with momentum is to compute an exponentially weighted average of your gradients, and then use that gradient to update your weights instead. It almost always works faster than the standard gradient descent algorithm.


Gradient Descent with Momentum

Learning rate decay is the gradual reduction of the learning rate as a function of time to speed up the learning algorithm. Decaying the learning rate as the gradient descent approaches completion reduces noise and facilitates a tighter convergence to a target.

Learning Rate Decay

Gradient descent is a fundamental optimization algorithm that leverages gradients to minimize a model's loss function. Because the gradient of a function points in the direction of steepest ascent, moving the model's parameters in the opposite direction iteratively lowers the loss. Each step of such gradient-based optimization algorithms requires calculating the exact gradient of the loss with respect to the parameters.

Gradient Descent

Adam stands for adaptive moment estimation.
It combines gradient descent with momentum, and RMSProp. It brings the benefits from both sides - adaptive learning rate and faster convergence with momentum.

Adam (Deep Learning Optimization Algorithm)

In the momentum method we basically first moved our weight in the direction of the current gradient and then moved in the direction  of momentum (weighted sum of all previous steps). Now in the new method we first move in the direction of the momentum and then calculate the gradient at the new point. Using this gradient we move in the direction of the new gradient. 

The update rules are as follows:
$$v \leftarrow \alpha v - \epsilon \nabla_{\theta} [\frac{1}{m} \sum^{m}_{i=1} L(f(x^{(i)};\theta + \alpha v), y^{(i)})]$$
$$\theta \leftarrow \theta + v$$

Nesterov momentum (Deep Learning Optimization Algorithm)

Adam stands for: adaptive moment estimation. Briefly, this method combines momentum and RMSprop (root mean squared prop).
Like momentum alone, RMSprop smooths the gradient, (it takes RMSProp and applies momentum to the rescaled gradients). This alternative approach is best explained mathematically:

Adam introduces four hyperparameters:
- learning rate alpha
- beta from momentum (usually 0.9)
- beta2 from RMSprop (usually 0.999)
- epsilon (usually 1e-8)

As mentioned above, you usually do not need to tune beta, beta2, and epsilon as the values listed above will generally work well. Only the learning rate is left to tune in order to accelerate training.


Adam combines the advantages of AdaGrad and RMSProp these two optimization algorithms. It comprehensively considers the first moment estimation of the gradient (First Moment Estimation, the mean value of the gradient) and the second moment estimation (Second Moment Estimation, the uncentered variance of the gradient), and calculates the update step size.

Adam optimization algorithm


The Adagrad optimization algorithm addresses the difficulty of tuning learning rates for sparse features by replacing simple feature occurrence counters with an aggregate of the squares of previously observed gradients. Specifically, it uses $$s(i, t+1) = s(i, t) + \left(\partial_i f(\mathbf{x})\right)^2$$ to adjust the learning rate. This automatically scales down the step size significantly for coordinates that frequently have large gradients, while applying a gentler treatment to coordinates with small gradients, thereby eliminating the need to manually decide when a gradient is considered large enough.

Adagrad

Adadelta

Optimization algorithms in deep learning face several challenges:
- **Local Optima**: It is actually unlikely to get stuck in local optima.
- **Cliffs**: On the face of an extremely steep cliff structure, the gradient update step can move the parameters extremely far.
- **Inexact Gradients**: Sometimes approximation is needed for gradients when the exact gradient is intractable.
- **Plateaus**: A low cost function slope (close to flat) makes learning slow.

Challenges with Deep Learning Optimizer Algorithms

```python
class RMSprop():
  
    # Object of this classes goes together with a layer parameters
    def __init__(self, input_dims, nodes):
  
      self.learning_rate = 0.01
      self.beta = 0.9

      # G matrx for parameters
      self.G_weights = np.zeros((input_dims, nodes))

      # G matrix for biases
      self.G_biases = np.zeros(nodes)


    # Function gets gradient of the weigths and biases
    # Returns the update that we need to substract from
    # the current weights and biases
    def get_steps(self, grad_weights, grad_biases):

      eps = 1e-8
      
      # updating G matrixes
      self.G_weights = self.beta * self.G_weights + (1 - self.beta) * np.power(grad_weights, 2)
      self.G_biases = self.beta * self.G_biases + (1 - self.beta) * np.power(grad_biases, 2)

      weights_step = np.multiply((self.learning_rate / np.sqrt(self.G_weights + eps)), grad_weights)
      biases_step = np.multiply((self.learning_rate / np.sqrt(self.G_biases + eps)), grad_biases)

      return weights_step, biases_step
```

RMSprop (Deep Learning Optimization Algorithm) Python implementation

- Adam is fast, but tends to overfit
- SGD is slow but gives great results
- RMSProp sometimes works best
- SWA can easily improve quality
- AdaTune magically improves the learning rate

Adam vs. SGD vs. RMSProp vs. SWA vs. AdaTune

On iteration t:
         Compute dW, db on the current mini-batch
                $S_{dW} = \beta S_{dW} + (1-\beta) dW^2$
                $S_{db} = \beta S_{db} + (1-\beta) db^2$
                $W := W - \alpha \frac{dW}{\sqrt{S_{dW}}}, b := b - \alpha \frac{db}{\sqrt{S_{db}}}$


RMSprop (Deep Learning Optimization Algorithm) Pseudocode

To visualize RMSProp's convergence behavior, the algorithm is applied to the two-dimensional quadratic function $$f(\mathbf{x}) = 0.1 x_1^2 + 2 x_2^2$$ with a learning rate of $$\eta = 0.4$$ and decay parameter $$\gamma = 0.9$$. The coordinate-wise implementation computes gradients $$g_1 = 0.2 x_1$$ and $$g_2 = 4 x_2$$, updates the leaky averages of squared gradients as $$s_i = \gamma s_i + (1 - \gamma) g_i^2$$, and adjusts each coordinate by $$x_i \leftarrow x_i - \frac{\eta}{\sqrt{s_i + \epsilon}} g_i$$. After 20 epochs, the variables converge near the origin ($$x_1 \approx -0.0106$$, $$x_2 \approx 0$$). Unlike AdaGrad, which stalls in later iterations because the learning rate decreases too quickly, RMSProp maintains effective progress throughout training because $$\eta$$ is controlled independently from the state variable rescaling.

```python
def rmsprop_2d(x1, x2, s1, s2):
    g1, g2, eps = 0.2 * x1, 4 * x2, 1e-6
    s1 = gamma * s1 + (1 - gamma) * g1 ** 2
    s2 = gamma * s2 + (1 - gamma) * g2 ** 2
    x1 -= eta / math.sqrt(s1 + eps) * g1
    x2 -= eta / math.sqrt(s2 + eps) * g2
    return x1, x2, s1, s2

eta, gamma = 0.4, 0.9
```

RMSProp Optimization Trajectory in 2D

A from-scratch implementation of the RMSProp optimizer for deep networks requires maintaining an auxiliary state variable for each parameter tensor, initialized to zeros with the same shape. During each update step, the state is updated as a leaky average of squared gradients: $$\mathbf{s} \leftarrow \gamma \mathbf{s} + (1 - \gamma) \mathbf{g}^2$$, where $$\gamma$$ is the decay factor and $$\mathbf{g}$$ is the current gradient. The parameter is then decremented by the learning rate times the gradient divided by the square root of the state plus a numerical stability constant ($$\epsilon = 10^{-6}$$). Finally, the parameter gradients are zeroed out.

In PyTorch, this can be implemented as follows:

python
def init_rmsprop_states(feature_dim):
    s_w = torch.zeros((feature_dim, 1))
    s_b = torch.zeros(1)
    return (s_w, s_b)

def rmsprop(params, states, hyperparams):
    gamma, eps = hyperparams['gamma'], 1e-6
    for p, s in zip(params, states):
        with torch.no_grad():
            s[:] = gamma * s + (1 - gamma) * torch.square(p.grad)
            p[:] -= hyperparams['lr'] * p.grad / torch.sqrt(s + eps)
        p.grad.data.zero_()


RMSProp Optimizer From-Scratch Implementation

In the RMSProp optimization algorithm, the effective observation window for the exponentially weighted average of squared gradients is defined by the quantity $$\frac{1}{1 - \gamma}$$, where $$\gamma$$ is the weighting term (or decay factor). This means the state variable aggregates information over approximately the past $$\frac{1}{1 - \gamma}$$ observations. A larger $$\gamma$$ produces a longer memory and a smoother average, while a smaller $$\gamma$$ makes the algorithm more responsive to recent gradients. For example, setting $$\gamma = 0.9$$ yields an effective window of $$\frac{1}{1 - 0.9} = 10$$ observations.

Effective Observation Window of RMSProp

The RMSProp update rule maintains a state variable $$G^{t}$$ that tracks the exponentially weighted average of squared gradients, and uses it to adaptively scale the learning rate:

$$G^{t} = \beta G^{t-1} + (1 - \beta) (
abla J(W^{t}))^2$$

$$W^{t} = W^{t-1} - \frac{\alpha}{\sqrt{G^{t} + \epsilon}} 
abla J(W^{t})$$

The same principle applies to the bias parameters.

- $$G^{t}$$: helper matrix for the algorithm
- $$\beta$$: the decay factor controlling how quickly the running average forgets old observations (typically around $$0.9$$)
- $$W^{t}$$: the model parameters
- $$\alpha$$: the initial learning rate (typically around $$0.01$$)
- $$\epsilon$$: a small constant to prevent division by zero (typically around $$10^{-6}$$ or $$10^{-8}$$)

Learn Before

Related

Learn After