```python
class RMSprop():
  
    # Object of this classes goes together with a layer parameters
    def __init__(self, input_dims, nodes):
  
      self.learning_rate = 0.01
      self.beta = 0.9

      # G matrx for parameters
      self.G_weights = np.zeros((input_dims, nodes))

      # G matrix for biases
      self.G_biases = np.zeros(nodes)


    # Function gets gradient of the weigths and biases
    # Returns the update that we need to substract from
    # the current weights and biases
    def get_steps(self, grad_weights, grad_biases):

      eps = 1e-8
      
      # updating G matrixes
      self.G_weights = self.beta * self.G_weights + (1 - self.beta) * np.power(grad_weights, 2)
      self.G_biases = self.beta * self.G_biases + (1 - self.beta) * np.power(grad_biases, 2)

      weights_step = np.multiply((self.learning_rate / np.sqrt(self.G_weights + eps)), grad_weights)
      biases_step = np.multiply((self.learning_rate / np.sqrt(self.G_biases + eps)), grad_biases)

      return weights_step, biases_step
```

RMSprop (Deep Learning Optimization Algorithm) Python implementation

- Adam is fast, but tends to overfit
- SGD is slow but gives great results
- RMSProp sometimes works best
- SWA can easily improve quality
- AdaTune magically improves the learning rate

Adam vs. SGD vs. RMSProp vs. SWA vs. AdaTune

On iteration t:
         Compute dW, db on the current mini-batch
                $S_{dW} = \beta S_{dW} + (1-\beta) dW^2$
                $S_{db} = \beta S_{db} + (1-\beta) db^2$
                $W := W - \alpha \frac{dW}{\sqrt{S_{dW}}}, b := b - \alpha \frac{db}{\sqrt{S_{db}}}$


RMSprop (Deep Learning Optimization Algorithm) Pseudocode

Adam stands for adaptive moment estimation.
It combines gradient descent with momentum, and RMSProp. It brings the benefits from both sides - adaptive learning rate and faster convergence with momentum.

Adam (Deep Learning Optimization Algorithm)

To visualize RMSProp's convergence behavior, the algorithm is applied to the two-dimensional quadratic function $$f(\mathbf{x}) = 0.1 x_1^2 + 2 x_2^2$$ with a learning rate of $$\eta = 0.4$$ and decay parameter $$\gamma = 0.9$$. The coordinate-wise implementation computes gradients $$g_1 = 0.2 x_1$$ and $$g_2 = 4 x_2$$, updates the leaky averages of squared gradients as $$s_i = \gamma s_i + (1 - \gamma) g_i^2$$, and adjusts each coordinate by $$x_i \leftarrow x_i - \frac{\eta}{\sqrt{s_i + \epsilon}} g_i$$. After $$20$$ epochs, the variables converge near the origin ($$x_1 \approx -0.0106$$, $$x_2 \approx 0$$). Unlike AdaGrad, which stalls in later iterations because the learning rate decreases too quickly, RMSProp maintains effective progress throughout training because $$\eta$$ is controlled independently from the state variable rescaling.

python
def rmsprop_2d(x1, x2, s1, s2):
    g1, g2, eps = 0.2 * x1, 4 * x2, 1e-6
    s1 = gamma * s1 + (1 - gamma) * g1 ** 2
    s2 = gamma * s2 + (1 - gamma) * g2 ** 2
    x1 -= eta / math.sqrt(s1 + eps) * g1
    x2 -= eta / math.sqrt(s2 + eps) * g2
    return x1, x2, s1, s2

eta, gamma = 0.4, 0.9


RMSProp Optimization Trajectory in 2D

A from-scratch implementation of the RMSProp optimizer for deep networks requires maintaining an auxiliary state variable for each parameter tensor, initialized to zeros with the same shape. During each update step, the state is updated as a leaky average of squared gradients: $$\mathbf{s} \leftarrow \gamma \mathbf{s} + (1 - \gamma) \mathbf{g}^2$$, where $$\gamma$$ is the decay factor and $$\mathbf{g}$$ is the current gradient. The parameter is then decremented by the learning rate times the gradient divided by the square root of the state plus a numerical stability constant ($$\epsilon = 10^{-6}$$). Finally, the parameter gradients are zeroed out.

In PyTorch, this can be implemented as follows:

python
def init_rmsprop_states(feature_dim):
    s_w = torch.zeros((feature_dim, 1))
    s_b = torch.zeros(1)
    return (s_w, s_b)

def rmsprop(params, states, hyperparams):
    gamma, eps = hyperparams['gamma'], 1e-6
    for p, s in zip(params, states):
        with torch.no_grad():
            s[:] = gamma * s + (1 - gamma) * torch.square(p.grad)
            p[:] -= hyperparams['lr'] * p.grad / torch.sqrt(s + eps)
        p.grad.data.zero_()


RMSProp Optimizer From-Scratch Implementation

When the weighting term $$\gamma$$ is set to $$0.9$$, the state variable $$\mathbf{s}$$ in RMSProp effectively aggregates information over the past $$\frac{1}{1 - \gamma} = \frac{1}{1 - 0.9} = 10$$ observations of the squared gradient. This applies the general exponentially weighted average principle—where a decay factor $$\beta$$ yields an effective window of $$\frac{1}{1 - \beta}$$—specifically to RMSProp's squared gradient state variable. The quantity $$\frac{1}{1 - \gamma}$$ defines the effective observation window: a larger $$\gamma$$ produces a longer memory (smoother average), while a smaller $$\gamma$$ makes the algorithm more responsive to recent gradients.

Effective Observation Window of RMSProp

The RMSProp update rule maintains a state variable $$G^{t}$$ that tracks the exponentially weighted average of squared gradients, and uses it to adaptively scale the learning rate:

$$G^{t} = \beta G^{t-1} + (1 - \beta) (
abla J(W^{t}))^2$$

$$W^{t} = W^{t-1} - \frac{\alpha}{\sqrt{G^{t} + \epsilon}} 
abla J(W^{t})$$

The same principle applies to the bias parameters.

- $$G^{t}$$: helper matrix for the algorithm
- $$\beta$$: the decay factor controlling how quickly the running average forgets old observations (typically around $$0.9$$)
- $$W^{t}$$: the model parameters
- $$\alpha$$: the initial learning rate (typically around $$0.01$$)
- $$\epsilon$$: a small constant to prevent division by zero (typically around $$10^{-6}$$ or $$10^{-8}$$)

RMSprop (Deep Learning Optimization Algorithm) Mathematical Implementations

Adadelta is an optimization algorithm that has no explicit learning rate parameter. Instead, it uses the rate of change in the parameters themselves to dynamically adapt the learning rate. To accomplish this, the algorithm utilizes two specific state variables: $$\mathbf{s}_t$$ to track a leaky average of the second moment of the gradient, and $$\Delta\mathbf{x}_t$$ to track a leaky average of the second moment of the model's parameter changes. The algorithm retains standard naming conventions for these variables to maintain consistency with similar optimization methods like momentum, AdaGrad, and RMSProp.

Adadelta

- Stands for Root Mean Square Propagation
- RMSProp is an optimization algorithm closely related to AdaGrad, as both employ the square of the gradient to scale the update coefficients on a per-coordinate basis. However, RMSProp overcomes AdaGrad's tendency for radically diminishing learning rates by using a leaky (exponentially weighted) average of squared gradients rather than a cumulative sum.
- RMSProp also shares the leaky averaging mechanism with the momentum method, but applies it differently: whereas momentum uses leaky averaging to smooth the gradient direction, RMSProp uses the technique to adjust the coefficient-wise preconditioner that rescales the learning rate independently for each parameter.
- Because RMSProp does not automatically schedule the learning rate (unlike AdaGrad, whose learning rate decays implicitly through accumulation), the learning rate must be explicitly scheduled by the practitioner in practice.
- The decay coefficient $$\gamma$$ governs how long the gradient history is retained when adjusting the per-coordinate scale: a larger $$\gamma$$ produces a longer memory, while a smaller $$\gamma$$ makes the algorithm more responsive to recent gradients.

Claude

University of Michigan - Ann Arbor

University of California, Berkeley

University of Massachusetts at Amherst

When a neural network trains, it uses an algorithm to determine the optimal weights for the model, called an optimizer. 
There are several optimizer algorithms, such as:
- Gradient descent
- Mini-batch gradient descent
- Gradient descent with momentum
- RMSprop
- Adam optimization algorithm
- Nesterov momentum
- AdaGrad

Deep Learning Optimizer Algorithms

Exponentially weighted average is a technique frequently used for time-series data. By taking the average sum of previous data, you could smooth your data series and get an approximate trend of it. 

Consider you have a series of data points $\theta_0,...,\theta_n$,
$$\left\{ \begin{array}{ll}v_t = \theta_t & t=0 \\
v_t = \beta v_{t-1} +(1-\beta)\theta_t & otherwise \end{array}\right.$$
If we expand the second formula,
$v_t  = \beta v_{t-1}+(1-\beta)\theta_t$ 
      $= (1-\beta)\theta_t+\beta(\beta v_{t-2}+(1-\beta)\theta_{t-1})$
      $= (1-\beta)\theta_t + (1-\beta)\beta\theta_{t-1}+ (1-\beta)\beta^2\theta_{t-2}+...$
To get a sense of how the weighted term changes as $\beta$ gets closer to 1,
$$(1 - \epsilon)^{1 / \epsilon}\approx \frac{1}{\epsilon} \Rightarrow \beta^{1/(1-\beta)}\approx \frac{1}{\epsilon}$$
If we denote $w_i$ be the weight we assign to $\theta_i$, then
                                         $w_{t-1/(1-\beta)}=\frac{1}{\epsilon}w_t$
Therefore, we are approximately average over $1/(1-\beta)$ days when calculating $v_t$.

Exponentially Weighted Average

Here is a very helpful article on different types of optimizer algorithms
https://ruder.io/optimizing-gradient-descent/index.html

An overview of gradient descent optimization algorithms

Dive into Deep Learning

Batch Gradient Descent requires the entire dataset to be processed to complete one step. Gradient Descent can require many steps in some cases which causes this procedure to be very slow and inefficient for large datasets.

Mini-Batch Gradient Descent solves this problem by taking small groups of random data points called mini-batches and using them to estimate each step.  The mini-batches are usually a size greater than 1 but less than N (dataset size). The smaller steps allow for a much faster algorithm.

Mini-Batch Gradient Descent

The basic idea of gradient descent with momentum is to compute an exponentially weighted average of your gradients, and then use that gradient to update your weights instead. It almost always works faster than the standard gradient descent algorithm.


Gradient Descent with Momentum

Learning rate decay is the gradual reduction of the learning rate as a function of time to speed up the learning algorithm. Decaying the learning rate as the gradient descent approaches completion reduces noise and facilitates a tighter convergence to a target.

Learning Rate Decay

Gradient descent is a fundamental optimization algorithm that leverages gradients to minimize a model's loss function. Because the gradient of a function points in the direction of steepest ascent, moving the model's parameters in the opposite direction iteratively lowers the loss. Each step of such gradient-based optimization algorithms requires calculating the exact gradient of the loss with respect to the parameters.

Gradient Descent

RMSprop (Deep Learning Optimization Algorithm)

In the momentum method we basically first moved our weight in the direction of the current gradient and then moved in the direction  of momentum (weighted sum of all previous steps). Now in the new method we first move in the direction of the momentum and then calculate the gradient at the new point. Using this gradient we move in the direction of the new gradient. 

The update rules are as follows:
$$v \leftarrow \alpha v - \epsilon \nabla_{\theta} [\frac{1}{m} \sum^{m}_{i=1} L(f(x^{(i)};\theta + \alpha v), y^{(i)})]$$
$$\theta \leftarrow \theta + v$$

Nesterov momentum (Deep Learning Optimization Algorithm)

- **Local optima**: it's actually unlikely to get stuck in local optima.
- **Cliffs**: on the face of an extremely steep cliﬀ structure, the
gradient update step can move the parameters extremely far
- **Inexact Gradients**: sometimes approximation is needed for gradients
- **Plateaus**: low cost function slope (close to flat) makes learning slow.

Challenges with Deep Learning Optimizer Algorithms

Adam stands for: adaptive moment estimation. Briefly, this method combines momentum and RMSprop (root mean squared prop).
Like momentum alone, RMSprop smooths the gradient, (it takes RMSProp and applies momentum to the rescaled gradients). This alternative approach is best explained mathematically:

Adam introduces four hyperparameters:
- learning rate alpha
- beta from momentum (usually 0.9)
- beta2 from RMSprop (usually 0.999)
- epsilon (usually 1e-8)

As mentioned above, you usually do not need to tune beta, beta2, and epsilon as the values listed above will generally work well. Only the learning rate is left to tune in order to accelerate training.


Adam combines the advantages of AdaGrad and RMSProp these two optimization algorithms. It comprehensively considers the first moment estimation of the gradient (First Moment Estimation, the mean value of the gradient) and the second moment estimation (Second Moment Estimation, the uncentered variance of the gradient), and calculates the update step size.

Adam optimization algorithm


Adam is different to classical stochastic gradient descent (SGD). SGD maintains a single learning rate (alpha) for all weight updates and the learning rate does not change during training. Adam combines the advantages of AdaGrad and RMSProp. It not only adapts the parameter learning rates based on the average first moment (the mean) as in RMAProp, but also makes use of the average of the second moments of the gradients (the uncentered variance).

Difference between Adam and SGD

The Adagrad optimization algorithm addresses the difficulty of tuning learning rates for sparse features by replacing simple feature occurrence counters with an aggregate of the squares of previously observed gradients. Specifically, it uses $$s(i, t+1) = s(i, t) + \left(\partial_i f(\mathbf{x})\right)^2$$ to adjust the learning rate. This automatically scales down the step size significantly for coordinates that frequently have large gradients, while applying a gentler treatment to coordinates with small gradients, thereby eliminating the need to manually decide when a gradient is considered large enough.

Adagrad

Assume we have a data series of temperature (blue dots), we could use the formula shown in the parent node to get an approximate trend. 
When $\beta=0.9$,
                   $\frac{1}{1-\beta} = \frac{1}{1-0.9} = 10$
So we are averaging over about 10 days.
When $\beta=0.98$,
                   $\frac{1}{1-\beta} = \frac{1}{1-0.98} = 50$
So we are averaging over about 50 days. Thus we will get a smoother curve as shown below.

An Example of Exponentially Weighted Average

In an exponentially weighted average, when the time step $$t$$ is small, the estimation only considers a few data points. This can cause an initial bias towards smaller values, especially if the initial value is set to $$v_0 = 0$$. Bias correction adjusts these early estimates to provide a more accurate trend. The bias-corrected value $$v_t'$$ is calculated as:

$$v_t' = \frac{v_t}{1 - \beta^t}$$

where $$v_t$$ is the uncorrected exponentially weighted average and $$\beta$$ is the weighting parameter. For example, as shown in the accompanying diagram, without correction, the early temperature estimates (the purple curve) indicate artificially low values due to the influence of the zero initialization, whereas the bias-corrected estimates (the green curve) better approximate the true underlying trend.

Learn Before

Related

Learn After