Keskar, N. S., & Socher, R. (2017). Improving generalization performance by switching from adam to sgd. arXiv preprint arXiv:1712.07628.

Improving Generalization Performance by Switching from Adam to SGD

 $ M^{t} = \frac { \beta_{1} M^{t-1} + (1 - \beta_{1} )  \nabla J(W^{t})} { 1 - (\beta_{1})^{t}} $

$ V^{t} = \frac { \beta_{2} V^{t-1} + (1 - \beta_{2} )  \nabla J^2(W^{t})} { 1 - (\beta_{2})^{t}} $

$ W^{t} = W^{t-1} - \frac{\alpha}{\sqrt{V^{t} + \epsilon}} M^{t} $

------------------------------------------------------------
$1 - (\beta_{2})^{t}$  and $1 - (\beta_{1})^{t}$ are used in order to normalize both matrices, as the authors of the algorithm noticed that M and V go to zero very fast.

$M^{t}$ - helper matrix that is similar to what we used for the momentum but normalized.

$V^{t}$ - helper matrix that is similar to what we used for the RMSprop but normalized.

$\beta_{1}, \beta_{2}$ - the terms identical to the ones in momentum and RMSprop (usually $\beta_{1}=0.9, \beta_{2}=0.999$).

$W^{t}$ - the parameters

$\alpha$ - starting learning rate (usually something around 0.001).

$\epsilon$ - it is just to avoid division by zero (usually around 1e-8).
The same principle applies to the bias parameters

Adam (Deep Learning Optimization Algorithm) Mathematical Implementation

```python
import numpy as np

class Adam():
  
    self.beta1 = 0.9
    self.beta2 = 0.99

    # Here is the M amd V matrices for the parameters and biases
    self.m_weights = np.zeros((input_dims, nodes))
    self.v_weights = np.zeros((input_dims, nodes))
    self.m_biases = np.zeros(nodes)
    self.v_biases = np.zeros(nodes)

    # need to track thecurrent time stamp
    self.curr_iter = 0

    def get_steps(self, grad_weights, grad_biases, learning_rate):

        eps = 1e-8

        # Just follows the formula and returns the update that have to be 
        # subtracted from the parametes and biases
      
        self.m_weights = self.beta1 * self.m_weights + (1 - self.beta1) * grad_weights
        self.v_weights = self.beta2 * self.v_weights + (1 - self.beta2) * np.power(grad_weights, 2)
        self.m_biases = self.beta1 * self.m_biases + (1 - self.beta1) * grad_biases
        self.v_biases = self.beta2 * self.v_biases + (1 - self.beta2) * np.power(grad_biases, 2)

        self.curr_iter += 1
        self.m_weights = self.m_weights / (1 - np.power(self.beta1, self.curr_iter))
        self.v_weights = self.v_weights / (1 - np.power(self.beta2, self.curr_iter))
        self.m_biases = self.m_biases / (1 - np.power(self.beta1, self.curr_iter))
        self.v_biases = self.v_biases / (1 - np.power(self.beta2, self.curr_iter))

        weights_step = np.multiply((learning_rate / (np.sqrt(self.v_weights) + eps)), self.m_weights)
        biases_step = np.multiply((learning_rate / (np.sqrt(self.v_biases) + eps)), self.m_biases)

        return weights_step, biases_step
```

Adam (Deep Learning Optimization Algorithm) Python Implementation

- Adam is fast, but tends to overfit
- SGD is slow but gives great results
- RMSProp sometimes works best
- SWA can easily improve quality
- AdaTune magically improves the learning rate

Adam vs. SGD vs. RMSProp vs. SWA vs. AdaTune

Adam stands for adaptive moment estimation.
It combines gradient descent with momentum, and RMSProp. It brings the benefits from both sides - adaptive learning rate and faster convergence with momentum.

University of Michigan - Ann Arbor

University of California, Berkeley

When a neural network trains, it uses an algorithm to determine the optimal weights for the model, called an optimizer. 
There are several optimizer algorithms, such as:
- Gradient descent
- Mini-batch gradient descent
- Gradient descent with momentum
- RMSprop
- Adam optimization algorithm
- Nesterov momentum
- AdaGrad

Deep Learning Optimizer Algorithms

- Stands for Root Mean Square Propagation
- RMSprop is a batch learning algorithm similar to AdaGrad that aims to deal with radically diminishing learning rates.

- Many times, gradients may be tiny, and others may be huge, which makes learning difficult — trying to find a single global learning rate for the algorithm.
RMSprop looks at the step size that’s defined for that weight instead of the magnitude of the gradient. The step size adapts individually over time, so that we accelerate learning in the direction that we need. In this way, RMSProp mimics initializing an instance of AdaGrad in a locally convex bowl, allowing it to converge rapidly there

RMSprop (Deep Learning Optimization Algorithm)

The basic idea of gradient descent with momentum is to compute an exponentially weighted average of your gradients, and then use that gradient to update your weights instead. It almost always works faster than the standard gradient descent algorithm.


Gradient Descent with Momentum

Here is a very helpful article on different types of optimizer algorithms
https://ruder.io/optimizing-gradient-descent/index.html

An overview of gradient descent optimization algorithms

Batch Gradient Descent requires the entire dataset to be processed to complete one step. Gradient Descent can require many steps in some cases which causes this procedure to be very slow and inefficient for large datasets.

Mini-Batch Gradient Descent solves this problem by taking small groups of random data points called mini-batches and using them to estimate each step.  The mini-batches are usually a size greater than 1 but less than N (dataset size). The smaller steps allow for a much faster algorithm.

Mini-Batch Gradient Descent

Learning rate decay is the gradual reduction of the learning rate as a function of time to speed up the learning algorithm. Decaying the learning rate as the gradient descent approaches completion reduces noise and facilitates a tighter convergence to a target.

Learning Rate Decay

Gradient descent is a fundamental optimization algorithm that leverages gradients to minimize a model's loss function. Because the gradient of a function points in the direction of steepest ascent, moving the model's parameters in the opposite direction iteratively lowers the loss. Each step of such gradient-based optimization algorithms requires calculating the exact gradient of the loss with respect to the parameters.

Gradient Descent

Similar to RMSProp, AdaDelta (Adaptive Delta) is a proposed method to compensate for the shortcomings of AdaGrad. In the same way as RMSProp, AdaDelta calculates the exponential mean instead of the sum when calculating the gradient sum of squares(often denoted G). Instead of simply using the step size as η, the exponential mean value is used with the square of the change value of the step size.

$G = \gamma G + (1-\gamma)(\nabla_{\theta}J(\theta_t))^2$
$\Delta_{\theta} =  \frac{\sqrt{s+\epsilon}}{\sqrt{G + \epsilon}} \cdot \nabla_{\theta}J(\theta_t)$
$\theta = \theta - \Delta_{\theta}$
$s = \gamma s + (1-\gamma) \Delta_{\theta}^2$

AdaDelta (Deep Learning Optimization Algorithm)

Adam (Deep Learning Optimization Algorithm)

So one of the big disadvantages of momentum and nesterov momentum algorithms is that they heavily rely on the learning rate. So AdaGrad is one of the algorithms that modifies the learning rate as we go. The intuition behind the adaptive learning rate is that it goes slower with frequent features and goes faster with features that happen rarely. 

AdaGrad (Deep Learning Optimization Algorithm)

In the momentum method we basically first moved our weight in the direction of the current gradient and then moved in the direction  of momentum (weighted sum of all previous steps). Now in the new method we first move in the direction of the momentum and then calculate the gradient at the new point. Using this gradient we move in the direction of the new gradient. 

The update rules are as follows:
$$v \leftarrow \alpha v - \epsilon \nabla_{\theta} [\frac{1}{m} \sum^{m}_{i=1} L(f(x^{(i)};\theta + \alpha v), y^{(i)})]$$
$$\theta \leftarrow \theta + v$$

Nesterov momentum (Deep Learning Optimization Algorithm)

- **Local optima**: it's actually unlikely to get stuck in local optima.
- **Cliffs**: on the face of an extremely steep cliﬀ structure, the
gradient update step can move the parameters extremely far
- **Inexact Gradients**: sometimes approximation is needed for gradients
- **Plateaus**: low cost function slope (close to flat) makes learning slow.

Challenges with Deep Learning Optimizer Algorithms

Adam stands for: adaptive moment estimation. Briefly, this method combines momentum and RMSprop (root mean squared prop).
Like momentum alone, RMSprop smooths the gradient, (it takes RMSProp and applies momentum to the rescaled gradients). This alternative approach is best explained mathematically:

Adam introduces four hyperparameters:
- learning rate alpha
- beta from momentum (usually 0.9)
- beta2 from RMSprop (usually 0.999)
- epsilon (usually 1e-8)

As mentioned above, you usually do not need to tune beta, beta2, and epsilon as the values listed above will generally work well. Only the learning rate is left to tune in order to accelerate training.


Adam combines the advantages of AdaGrad and RMSProp these two optimization algorithms. It comprehensively considers the first moment estimation of the gradient (First Moment Estimation, the mean value of the gradient) and the second moment estimation (Second Moment Estimation, the uncentered variance of the gradient), and calculates the update step size.

Adam optimization algorithm


Adam is different to classical stochastic gradient descent (SGD). SGD maintains a single learning rate (alpha) for all weight updates and the learning rate does not change during training. Adam combines the advantages of AdaGrad and RMSProp. It not only adapts the parameter learning rates based on the average first moment (the mean) as in RMAProp, but also makes use of the average of the second moments of the gradients (the uncentered variance).

Difference between Adam and SGD

 $ G^{t} = \beta G^{t-1} + (1 - \beta) \nabla J^2(W^{t})$ 

$ W^{t} = W^{t-1} - \frac{\alpha}{\sqrt{G^{t} + \epsilon}} \nabla J^2(W^{t}) $

-----------------------------------------
The same principle applies to the bias parameters

$G^{t}$ - helper matrix for the algorithm

$ \beta $ - the term that helps us to decrease the matrix G(usually around 0.9)

$W^{t}$ - the parameters

$\alpha$ - starting learning rate(usually something around 0.1 or 0.01)

$\epsilon$ - it is just to avoid division by zero( usually around 1e-8 ) 

RMSprop (Deep Learning Optimization Algorithm) Mathematical Implementations

```python
class RMSprop():
  
    # Object of this classes goes together with a layer parameters
    def __init__(self, input_dims, nodes):
  
      self.learning_rate = 0.01
      self.beta = 0.9

      # G matrx for parameters
      self.G_weights = np.zeros((input_dims, nodes))

      # G matrix for biases
      self.G_biases = np.zeros(nodes)


    # Function gets gradient of the weigths and biases
    # Returns the update that we need to substract from
    # the current weights and biases
    def get_steps(self, grad_weights, grad_biases):

      eps = 1e-8
      
      # updating G matrixes
      self.G_weights = self.beta * self.G_weights + (1 - self.beta) * np.power(grad_weights, 2)
      self.G_biases = self.beta * self.G_biases + (1 - self.beta) * np.power(grad_biases, 2)

      weights_step = np.multiply((self.learning_rate / np.sqrt(self.G_weights + eps)), grad_weights)
      biases_step = np.multiply((self.learning_rate / np.sqrt(self.G_biases + eps)), grad_biases)

      return weights_step, biases_step
```

RMSprop (Deep Learning Optimization Algorithm) Python implementation

On iteration t:
         Compute dW, db on the current mini-batch
                $S_{dW} = \beta S_{dW} + (1-\beta) dW^2$
                $S_{db} = \beta S_{db} + (1-\beta) db^2$
                $W := W - \alpha \frac{dW}{\sqrt{S_{dW}}}, b := b - \alpha \frac{db}{\sqrt{S_{db}}}$


RMSprop (Deep Learning Optimization Algorithm) Pseudocode

As shown in the picture below, for Gradient descent optimizer, we will have ups and downs in the vertical direction, but it continues to go right in the horizontal direction. By taking the average of the few previous gradients, you will decrease oscillations in the vertical direction by averaging out positive and negative values. And since all gradients point to the same direction horizontally, the result in the horizontal direction will remain a large value in the right direction.

Intuition behind Gradient Descent with Momentum

These plots were generated with gradient descent; with gradient descent with momentum (β = 0.5) and gradient descent with momentum (β = 0.9). Which curve corresponds to which algorithm?

On iteration t:
         Compute dW, db on the current mini-batch
                $v_{dW} = \beta v_{dW} + (1-\beta)dW$
                $v_{db} = \beta v_{db} + (1-\beta)db$
                $W = W - \alpha v_{dW}, b = b - \alpha v_{db}$
Note that now we have two parameters $\alpha$ and $\beta$. 

Learn Before

Related

Learn After