$G^{t} = G^{t-1} +  \nabla J^2(W^{t-1}) $

$W^{t} = W^{t-1} - \frac{ \alpha} {\sqrt{G^{t} + \epsilon}}  \nabla J(W^{t-1})$
-----------------------------------------
$G^{t}$ - helper matrix for the algorithm

$W^{t}$ - the parameters

$\alpha$ - starting learning rate(usually someting around 0.1 or 0.01)

$\epsilon$ - it is just to avoid division by zero( usually around 1e-8 ) 

The same principle applies to the bias parameters

Mathematical Implementation

 Pros:

 - No need to care about the learning rate because it changes by itself

Cons:

 - Sometimes the sum of squared gradients can get very big and therefore the learning rate can become equal to zero and model would stop learning

Pros and Cons

"""So below the AdaGrad class that shows how a way to implement it in python. The class get one layers node number and its input dimensions. A function get_steps takes in current gradients for weights and biases and returns the step that we need to update those weights and biases"""

```python

import numpy as np

class AdaGrad():

    def __init__(self, input_dims, nodes):
  
      self.learning_rate = 0.01

      # G matrix for parameters
      self.G_weights = np.zeros((input_dims, nodes))

      # G matrix for biases
      self.G_biases = np.zeros(nodes)

    def get_steps(self, grad_weights, grad_biases):

      eps = 1e-8
      
      # updating G matrixes
      self.G_weights += np.multiply(grad_weights, grad_weights)
      self.G_biases += np.multiply(grad_biases, grad_biases)

      weights_step = np.multiply((self.learning_rate / np.sqrt(self.G_weights + eps)), grad_weights)
      biases_step = np.multiply((self.learning_rate / np.sqrt(self.G_biases + eps)), grad_biases)

      return weights_step, biases_step
```

Python Implementation

So one of the big disadvantages of momentum and nesterov momentum algorithms is that they heavily rely on the learning rate. So AdaGrad is one of the algorithms that modifies the learning rate as we go. The intuition behind the adaptive learning rate is that it goes slower with frequent features and goes faster with features that happen rarely. 

University of California, Berkeley

University of Michigan - Ann Arbor

When a neural network trains, it uses an algorithm to determine the optimal weights for the model, called an optimizer. 
There are several optimizer algorithms, such as:
- Gradient descent
- Mini-batch gradient descent
- Gradient descent with momentum
- RMSprop
- Adam optimization algorithm
- Nesterov momentum
- AdaGrad

Deep Learning Optimizer Algorithms

Here is a very helpful article on different types of optimizer algorithms
https://ruder.io/optimizing-gradient-descent/index.html

An overview of gradient descent optimization algorithms

Batch Gradient Descent requires the entire dataset to be processed to complete one step. Gradient Descent can require many steps in some cases which causes this procedure to be very slow and inefficient for large datasets.

Mini-Batch Gradient Descent solves this problem by taking small groups of random data points called mini-batches and using them to estimate each step.  The mini-batches are usually a size greater than 1 but less than N (dataset size). The smaller steps allow for a much faster algorithm.

Mini-Batch Gradient Descent

The basic idea of gradient descent with momentum is to compute an exponentially weighted average of your gradients, and then use that gradient to update your weights instead. It almost always works faster than the standard gradient descent algorithm.


Gradient Descent with Momentum

Learning rate decay is the gradual reduction of the learning rate as a function of time to speed up the learning algorithm. Decaying the learning rate as the gradient descent approaches completion reduces noise and facilitates a tighter convergence to a target.

Learning Rate Decay

Gradient descent is a fundamental optimization algorithm that leverages gradients to minimize a model's loss function. Because the gradient of a function points in the direction of steepest ascent, moving the model's parameters in the opposite direction iteratively lowers the loss. Each step of such gradient-based optimization algorithms requires calculating the exact gradient of the loss with respect to the parameters.

Gradient Descent

Similar to RMSProp, AdaDelta (Adaptive Delta) is a proposed method to compensate for the shortcomings of AdaGrad. In the same way as RMSProp, AdaDelta calculates the exponential mean instead of the sum when calculating the gradient sum of squares(often denoted G). Instead of simply using the step size as η, the exponential mean value is used with the square of the change value of the step size.

$G = \gamma G + (1-\gamma)(\nabla_{\theta}J(\theta_t))^2$
$\Delta_{\theta} =  \frac{\sqrt{s+\epsilon}}{\sqrt{G + \epsilon}} \cdot \nabla_{\theta}J(\theta_t)$
$\theta = \theta - \Delta_{\theta}$
$s = \gamma s + (1-\gamma) \Delta_{\theta}^2$

AdaDelta (Deep Learning Optimization Algorithm)

Adam stands for adaptive moment estimation.
It combines gradient descent with momentum, and RMSProp. It brings the benefits from both sides - adaptive learning rate and faster convergence with momentum.

Adam (Deep Learning Optimization Algorithm)

- Stands for Root Mean Square Propagation
- RMSprop is a batch learning algorithm similar to AdaGrad that aims to deal with radically diminishing learning rates.

- Many times, gradients may be tiny, and others may be huge, which makes learning difficult — trying to find a single global learning rate for the algorithm.
RMSprop looks at the step size that’s defined for that weight instead of the magnitude of the gradient. The step size adapts individually over time, so that we accelerate learning in the direction that we need. In this way, RMSProp mimics initializing an instance of AdaGrad in a locally convex bowl, allowing it to converge rapidly there

RMSprop (Deep Learning Optimization Algorithm)

AdaGrad (Deep Learning Optimization Algorithm)

In the momentum method we basically first moved our weight in the direction of the current gradient and then moved in the direction  of momentum (weighted sum of all previous steps). Now in the new method we first move in the direction of the momentum and then calculate the gradient at the new point. Using this gradient we move in the direction of the new gradient. 

The update rules are as follows:
$$v \leftarrow \alpha v - \epsilon \nabla_{\theta} [\frac{1}{m} \sum^{m}_{i=1} L(f(x^{(i)};\theta + \alpha v), y^{(i)})]$$
$$\theta \leftarrow \theta + v$$

Nesterov momentum (Deep Learning Optimization Algorithm)

- **Local optima**: it's actually unlikely to get stuck in local optima.
- **Cliffs**: on the face of an extremely steep cliﬀ structure, the
gradient update step can move the parameters extremely far
- **Inexact Gradients**: sometimes approximation is needed for gradients
- **Plateaus**: low cost function slope (close to flat) makes learning slow.

Challenges with Deep Learning Optimizer Algorithms

Adam stands for: adaptive moment estimation. Briefly, this method combines momentum and RMSprop (root mean squared prop).
Like momentum alone, RMSprop smooths the gradient, (it takes RMSProp and applies momentum to the rescaled gradients). This alternative approach is best explained mathematically:

Adam introduces four hyperparameters:
- learning rate alpha
- beta from momentum (usually 0.9)
- beta2 from RMSprop (usually 0.999)
- epsilon (usually 1e-8)

As mentioned above, you usually do not need to tune beta, beta2, and epsilon as the values listed above will generally work well. Only the learning rate is left to tune in order to accelerate training.


Adam combines the advantages of AdaGrad and RMSProp these two optimization algorithms. It comprehensively considers the first moment estimation of the gradient (First Moment Estimation, the mean value of the gradient) and the second moment estimation (Second Moment Estimation, the uncentered variance of the gradient), and calculates the update step size.

Adam optimization algorithm


Adam is different to classical stochastic gradient descent (SGD). SGD maintains a single learning rate (alpha) for all weight updates and the learning rate does not change during training. Adam combines the advantages of AdaGrad and RMSProp. It not only adapts the parameter learning rates based on the average first moment (the mean) as in RMAProp, but also makes use of the average of the second moments of the gradients (the uncentered variance).

Learn Before

Related

Learn After