- Adam is fast, but tends to overfit
- SGD is slow but gives great results
- RMSProp sometimes works best
- SWA can easily improve quality
- AdaTune magically improves the learning rate

University of Michigan - Ann Arbor

Adam stands for adaptive moment estimation.
It combines gradient descent with momentum, and RMSProp. It brings the benefits from both sides - adaptive learning rate and faster convergence with momentum.

Adam (Deep Learning Optimization Algorithm)

- Stands for Root Mean Square Propagation
- RMSprop is a batch learning algorithm similar to AdaGrad that aims to deal with radically diminishing learning rates.

- Many times, gradients may be tiny, and others may be huge, which makes learning difficult — trying to find a single global learning rate for the algorithm.
RMSprop looks at the step size that’s defined for that weight instead of the magnitude of the gradient. The step size adapts individually over time, so that we accelerate learning in the direction that we need. In this way, RMSProp mimics initializing an instance of AdaGrad in a locally convex bowl, allowing it to converge rapidly there

RMSprop (Deep Learning Optimization Algorithm)

If we choose the mini-batch size to be 1, then it gives the algorithm called Stochastic Gradient Descent or SGD.

In this case, on every iteration, you're taking gradient descent with just a single training example
$w = w - \alpha \nabla_w J(x^i, y^i; w)$

The most important property of SGD is that computation time per step does not grow with the number of examples. This makes SGD very efficient with large training sets.

The learning rate is a hyperparameter that must be adjusted. Unlike regular parameters of a model (weights like w and b), which are learned by the algorithm from the training set, hyperparameters are special parameters chosen by the algorithm designer that affect how the algorithm works.

Stochastic Gradient Descent Algorithm

https://www.youtube.com/watch?v=S27pHKBEp30

LSTM is dead. Long Live Transformers!

Keskar, N. S., & Socher, R. (2017). Improving generalization performance by switching from adam to sgd. arXiv preprint arXiv:1712.07628.

Improving Generalization Performance by Switching from Adam to SGD

 $ M^{t} = \frac { \beta_{1} M^{t-1} + (1 - \beta_{1} )  \nabla J(W^{t})} { 1 - (\beta_{1})^{t}} $

$ V^{t} = \frac { \beta_{2} V^{t-1} + (1 - \beta_{2} )  \nabla J^2(W^{t})} { 1 - (\beta_{2})^{t}} $

$ W^{t} = W^{t-1} - \frac{\alpha}{\sqrt{V^{t} + \epsilon}} M^{t} $

------------------------------------------------------------
$1 - (\beta_{2})^{t}$  and $1 - (\beta_{1})^{t}$ are used in order to normalize both matrices, as the authors of the algorithm noticed that M and V go to zero very fast.

$M^{t}$ - helper matrix that is similar to what we used for the momentum but normalized.

$V^{t}$ - helper matrix that is similar to what we used for the RMSprop but normalized.

$\beta_{1}, \beta_{2}$ - the terms identical to the ones in momentum and RMSprop (usually $\beta_{1}=0.9, \beta_{2}=0.999$).

$W^{t}$ - the parameters

$\alpha$ - starting learning rate (usually something around 0.001).

$\epsilon$ - it is just to avoid division by zero (usually around 1e-8).
The same principle applies to the bias parameters

Adam (Deep Learning Optimization Algorithm) Mathematical Implementation

```python
import numpy as np

class Adam():
  
    self.beta1 = 0.9
    self.beta2 = 0.99

    # Here is the M amd V matrices for the parameters and biases
    self.m_weights = np.zeros((input_dims, nodes))
    self.v_weights = np.zeros((input_dims, nodes))
    self.m_biases = np.zeros(nodes)
    self.v_biases = np.zeros(nodes)

    # need to track thecurrent time stamp
    self.curr_iter = 0

    def get_steps(self, grad_weights, grad_biases, learning_rate):

        eps = 1e-8

        # Just follows the formula and returns the update that have to be 
        # subtracted from the parametes and biases
      
        self.m_weights = self.beta1 * self.m_weights + (1 - self.beta1) * grad_weights
        self.v_weights = self.beta2 * self.v_weights + (1 - self.beta2) * np.power(grad_weights, 2)
        self.m_biases = self.beta1 * self.m_biases + (1 - self.beta1) * grad_biases
        self.v_biases = self.beta2 * self.v_biases + (1 - self.beta2) * np.power(grad_biases, 2)

        self.curr_iter += 1
        self.m_weights = self.m_weights / (1 - np.power(self.beta1, self.curr_iter))
        self.v_weights = self.v_weights / (1 - np.power(self.beta2, self.curr_iter))
        self.m_biases = self.m_biases / (1 - np.power(self.beta1, self.curr_iter))
        self.v_biases = self.v_biases / (1 - np.power(self.beta2, self.curr_iter))

        weights_step = np.multiply((learning_rate / (np.sqrt(self.v_weights) + eps)), self.m_weights)
        biases_step = np.multiply((learning_rate / (np.sqrt(self.v_biases) + eps)), self.m_biases)

        return weights_step, biases_step
```

Adam (Deep Learning Optimization Algorithm) Python Implementation

Adam vs. SGD vs. RMSProp vs. SWA vs. AdaTune

 $ G^{t} = \beta G^{t-1} + (1 - \beta) \nabla J^2(W^{t})$ 

$ W^{t} = W^{t-1} - \frac{\alpha}{\sqrt{G^{t} + \epsilon}} \nabla J^2(W^{t}) $

-----------------------------------------
The same principle applies to the bias parameters

$G^{t}$ - helper matrix for the algorithm

$ \beta $ - the term that helps us to decrease the matrix G(usually around 0.9)

$W^{t}$ - the parameters

$\alpha$ - starting learning rate(usually something around 0.1 or 0.01)

$\epsilon$ - it is just to avoid division by zero( usually around 1e-8 ) 

RMSprop (Deep Learning Optimization Algorithm) Mathematical Implementations

```python
class RMSprop():
  
    # Object of this classes goes together with a layer parameters
    def __init__(self, input_dims, nodes):
  
      self.learning_rate = 0.01
      self.beta = 0.9

      # G matrx for parameters
      self.G_weights = np.zeros((input_dims, nodes))

      # G matrix for biases
      self.G_biases = np.zeros(nodes)


    # Function gets gradient of the weigths and biases
    # Returns the update that we need to substract from
    # the current weights and biases
    def get_steps(self, grad_weights, grad_biases):

      eps = 1e-8
      
      # updating G matrixes
      self.G_weights = self.beta * self.G_weights + (1 - self.beta) * np.power(grad_weights, 2)
      self.G_biases = self.beta * self.G_biases + (1 - self.beta) * np.power(grad_biases, 2)

      weights_step = np.multiply((self.learning_rate / np.sqrt(self.G_weights + eps)), grad_weights)
      biases_step = np.multiply((self.learning_rate / np.sqrt(self.G_biases + eps)), grad_biases)

      return weights_step, biases_step
```

RMSprop (Deep Learning Optimization Algorithm) Python implementation

On iteration t:
         Compute dW, db on the current mini-batch
                $S_{dW} = \beta S_{dW} + (1-\beta) dW^2$
                $S_{db} = \beta S_{db} + (1-\beta) db^2$
                $W := W - \alpha \frac{dW}{\sqrt{S_{dW}}}, b := b - \alpha \frac{db}{\sqrt{S_{db}}}$


RMSprop (Deep Learning Optimization Algorithm) Pseudocode

Batch gradient descent (batch size = N) takes relatively low noise, relatively large steps. And you could just keep matching to the minimum. However, it may take a long time to process and need additional memory.

Stochastic gradient descent (batch size = 1)  is easy to fit in memory and efficient for large datasets. But it can be extremely noisy since sometimes you hit in the wrong direction if that a training example happens to point in a bad direction. It won't ever converge, and will always just kind of oscillate and wander around the region of the minimum. 

in practice, mini-batch gradient descent with batch size in between 1 and N works better. It's not guaranteed to always head toward the minimum but it tends to head more consistently in direction of the minimum. 

Learn Before

Related