As shown in the picture below, for Gradient descent optimizer, we will have ups and downs in the vertical direction, but it continues to go right in the horizontal direction. By taking the average of the few previous gradients, you will decrease oscillations in the vertical direction by averaging out positive and negative values. And since all gradients point to the same direction horizontally, the result in the horizontal direction will remain a large value in the right direction.

Intuition behind Gradient Descent with Momentum

These plots were generated with gradient descent; with gradient descent with momentum (β = 0.5) and gradient descent with momentum (β = 0.9). Which curve corresponds to which algorithm?

On iteration t:
         Compute dW, db on the current mini-batch
                $v_{dW} = \beta v_{dW} + (1-\beta)dW$
                $v_{db} = \beta v_{db} + (1-\beta)db$
                $W = W - \alpha v_{dW}, b = b - \alpha v_{db}$
Note that now we have two parameters $\alpha$ and $\beta$. 

Gradient Descent with Momentum Pseudocode 

Adam stands for adaptive moment estimation.
It combines gradient descent with momentum, and RMSProp. It brings the benefits from both sides - adaptive learning rate and faster convergence with momentum.

Adam (Deep Learning Optimization Algorithm)

The basic idea of gradient descent with momentum is to compute an exponentially weighted average of your gradients, and then use that gradient to update your weights instead. It almost always works faster than the standard gradient descent algorithm.


University of Michigan - Ann Arbor

Exponentially weighted average is a technique frequently used for time-series data. By taking the average sum of previous data, you could smooth your data series and get an approximate trend of it. 

Consider you have a series of data points $\theta_0,...,\theta_n$,
$$\left\{ \begin{array}{ll}v_t = \theta_t & t=0 \\
v_t = \beta v_{t-1} +(1-\beta)\theta_t & otherwise \end{array}\right.$$
If we expand the second formula,
$v_t  = \beta v_{t-1}+(1-\beta)\theta_t$ 
      $= (1-\beta)\theta_t+\beta(\beta v_{t-2}+(1-\beta)\theta_{t-1})$
      $= (1-\beta)\theta_t + (1-\beta)\beta\theta_{t-1}+ (1-\beta)\beta^2\theta_{t-2}+...$
To get a sense of how the weighted term changes as $\beta$ gets closer to 1,
$$(1 - \epsilon)^{1 / \epsilon}\approx \frac{1}{\epsilon} \Rightarrow \beta^{1/(1-\beta)}\approx \frac{1}{\epsilon}$$
If we denote $w_i$ be the weight we assign to $\theta_i$, then
                                         $w_{t-1/(1-\beta)}=\frac{1}{\epsilon}w_t$
Therefore, we are approximately average over $1/(1-\beta)$ days when calculating $v_t$.

Exponentially Weighted Average

When a neural network trains, it uses an algorithm to determine the optimal weights for the model, called an optimizer. 
There are several optimizer algorithms, such as:
- Gradient descent
- Mini-batch gradient descent
- Gradient descent with momentum
- RMSprop
- Adam optimization algorithm
- Nesterov momentum
- AdaGrad

Deep Learning Optimizer Algorithms

Assuming that the error function is $J(w)$ with one parameter $w$, to minimize the error, we can update the weight $w$ as follows.
$$w = w - \alpha * \frac{dJ(w)}{dw}$$
, where $\alpha$ is a learning rate, and $\frac{dJ(w)}{dw}$ is the derivative of $J(w)$ with respect to $w$.

If the error function has two or more parameters, for example, a weight $w$ and a bias $b$, we can update them one by one.
$$w = w - \alpha * \frac{\partial J(w,b)}{\partial w}$$
$$b = b - \alpha * \frac{\partial J(w,b)}{\partial b}$$
, where $\partial$ is a stylish cursive $d$, denoting the partial derivatives.

(Batch) Gradient Descent (Deep Learning Optimization Algorithm)

https://www.coursera.org/learn/deep-neural-network?specialization=deep-learning

Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

A helpful website for understanding gradient descent:
https://towardsdatascience.com/gradient-descent-explained-9b953fc0d2c

Gradient Descent Reference

Assume we have a data series of temperature (blue dots), we could use the formula shown in the parent node to get an approximate trend. 
When $\beta=0.9$,
                   $\frac{1}{1-\beta} = \frac{1}{1-0.9} = 10$
So we are averaging over about 10 days.
When $\beta=0.98$,
                   $\frac{1}{1-\beta} = \frac{1}{1-0.98} = 50$
So we are averaging over about 50 days. Thus we will get a smoother curve as shown below.

An Example of Exponentially Weighted Average

Gradient Descent with Momentum

In the previous definition of exponentially weighted average, when t is small, we only consider a few data points which may cause bias. Biased Correction is thus to help correct it.
$$v_t'=\frac{v_t}{1-\beta^t}$$

For example, in the following diagram bias has happened because in early days the estimated temperature is influenced by $v_0 = 0$. We are looking for the green function, but we have estimated the purple one that indicates biased estimates on early days.

Bias Correction

- Stands for Root Mean Square Propagation
- RMSprop is a batch learning algorithm similar to AdaGrad that aims to deal with radically diminishing learning rates.

- Many times, gradients may be tiny, and others may be huge, which makes learning difficult — trying to find a single global learning rate for the algorithm.
RMSprop looks at the step size that’s defined for that weight instead of the magnitude of the gradient. The step size adapts individually over time, so that we accelerate learning in the direction that we need. In this way, RMSProp mimics initializing an instance of AdaGrad in a locally convex bowl, allowing it to converge rapidly there

RMSprop (Deep Learning Optimization Algorithm)

Batch Gradient Descent requires the entire dataset to be processed to complete one step. Gradient Descent can require many steps in some cases which causes this procedure to be very slow and inefficient for large datasets.

Mini-Batch Gradient Descent solves this problem by taking small groups of random data points called mini-batches and using them to estimate each step.  The mini-batches are usually a size greater than 1 but less than N (dataset size). The smaller steps allow for a much faster algorithm.

Mini-Batch Gradient Descent

Here is a very helpful article on different types of optimizer algorithms
https://ruder.io/optimizing-gradient-descent/index.html

An overview of gradient descent optimization algorithms

Learning rate decay is the gradual reduction of the learning rate as a function of time to speed up the learning algorithm. Decaying the learning rate as the gradient descent approaches completion reduces noise and facilitates a tighter convergence to a target.

Learning Rate Decay

A mathematical technique that modifies the parameters of a function to descend from a high value of a function to a low value, by looking at the derivatives of the function with respect to each of its parameters, and seeing which step, via which parameter, is the next best step to minimize the function. Applying gradient descent to the error function helps find weights that achieve lower and lower error values, making the model gradually more accurate.

Gradient Descent

Similar to RMSProp, AdaDelta (Adaptive Delta) is a proposed method to compensate for the shortcomings of AdaGrad. In the same way as RMSProp, AdaDelta calculates the exponential mean instead of the sum when calculating the gradient sum of squares(often denoted G). Instead of simply using the step size as η, the exponential mean value is used with the square of the change value of the step size.

$G = \gamma G + (1-\gamma)(\nabla_{\theta}J(\theta_t))^2$
$\Delta_{\theta} =  \frac{\sqrt{s+\epsilon}}{\sqrt{G + \epsilon}} \cdot \nabla_{\theta}J(\theta_t)$
$\theta = \theta - \Delta_{\theta}$
$s = \gamma s + (1-\gamma) \Delta_{\theta}^2$

AdaDelta (Deep Learning Optimization Algorithm)

So one of the big disadvantages of momentum and nesterov momentum algorithms is that they heavily rely on the learning rate. So AdaGrad is one of the algorithms that modifies the learning rate as we go. The intuition behind the adaptive learning rate is that it goes slower with frequent features and goes faster with features that happen rarely. 

AdaGrad (Deep Learning Optimization Algorithm)

In the momentum method we basically first moved our weight in the direction of the current gradient and then moved in the direction  of momentum (weighted sum of all previous steps). Now in the new method we first move in the direction of the momentum and then calculate the gradient at the new point. Using this gradient we move in the direction of the new gradient. 

The update rules are as follows:
$$v \leftarrow \alpha v - \epsilon \nabla_{\theta} [\frac{1}{m} \sum^{m}_{i=1} L(f(x^{(i)};\theta + \alpha v), y^{(i)})]$$
$$\theta \leftarrow \theta + v$$

Nesterov momentum (Deep Learning Optimization Algorithm)

- **Local optima**: it's actually unlikely to get stuck in local optima.
- **Cliffs**: on the face of an extremely steep cliﬀ structure, the
gradient update step can move the parameters extremely far
- **Inexact Gradients**: sometimes approximation is needed for gradients
- **Plateaus**: low cost function slope (close to flat) makes learning slow.

Challenges with Deep Learning Optimizer Algorithms

Adam stands for: adaptive moment estimation. Briefly, this method combines momentum and RMSprop (root mean squared prop).
Like momentum alone, RMSprop smooths the gradient, (it takes RMSProp and applies momentum to the rescaled gradients). This alternative approach is best explained mathematically:

Adam introduces four hyperparameters:
- learning rate alpha
- beta from momentum (usually 0.9)
- beta2 from RMSprop (usually 0.999)
- epsilon (usually 1e-8)

As mentioned above, you usually do not need to tune beta, beta2, and epsilon as the values listed above will generally work well. Only the learning rate is left to tune in order to accelerate training.


Adam combines the advantages of AdaGrad and RMSProp these two optimization algorithms. It comprehensively considers the first moment estimation of the gradient (First Moment Estimation, the mean value of the gradient) and the second moment estimation (Second Moment Estimation, the uncentered variance of the gradient), and calculates the update step size.

Adam optimization algorithm


Adam is different to classical stochastic gradient descent (SGD). SGD maintains a single learning rate (alpha) for all weight updates and the learning rate does not change during training. Adam combines the advantages of AdaGrad and RMSProp. It not only adapts the parameter learning rates based on the average first moment (the mean) as in RMAProp, but also makes use of the average of the second moments of the gradients (the uncentered variance).

Difference between Adam and SGD

If we only have two features, $x_1$ and $x_2$, in order to minimize the loss function, we can apply gradient descent to update $w_1$, $w_2$, and $b$.
To compute the derivatives of $\mathcal L (a, y)$ with respect to $w_1$, $w_2$, and $b$, we need to compute the derivatives of $\mathcal L (a, y)$ with respect to $a$ and $z$ first.
$$\mathcal L (a, y) = -(ylog(a) + (1 - y)log(1 - a)) \Rightarrow$$
$$\frac{d\mathcal L (a, y)}{da} = -\frac{y}{a}+\frac{1-y}{1-a}$$
$$a = \sigma(z) = \frac{1}{1 + e^{-z}} \Rightarrow \frac{da}{dz} = a(1-a) \Rightarrow$$
$$\begin{aligned}
\frac{d\mathcal L (a, y)}{dz} & = \frac{d\mathcal L (a, y)}{da}\frac{da}{dz} \\
& = (-\frac{y}{a}+\frac{1-y}{1-a})*(a(1-a)) = a-y \\
\end{aligned}$$
$$\begin{aligned}
\frac{d\mathcal L (a, y)}{dw_1} & = \frac{d\mathcal L (a, y)}{dz}\frac{dz}{dw_1} = (a-y)*x_1 \\ 
\end{aligned}$$
$$\begin{aligned}
\frac{d\mathcal L (a, y)}{dw_2} & = \frac{d\mathcal L (a, y)}{dz}\frac{dz}{dw_2} = (a-y)*x_2 \\ 
\end{aligned}$$
$$\begin{aligned}
\frac{d\mathcal L (a, y)}{db} & = \frac{d\mathcal L (a, y)}{dz}\frac{dz}{db} = (a-y)*1 = (a-y)
\end{aligned}$$

Logistic regression gradient descent

The gradient $\nabla_x f(x)$ of a scalar function $f(x_1, x_2, x_3, ..., x_n)$ is defined as the unique vector field whose dot product with any vector $v$ at each point $x$ is the directional derivative of $f$ along $v$. That is,
$ \nabla_x f(x) \cdot  v = \nabla_v f(x) $

The directional derivative in direction $v$ (a unit vector) is the slope of the function $f$ in direction $v$, namely the rate of increase of $f$ per unit of distance moved in the direction given by $v$. 

To minimize $f$, we would like to ﬁnd the direction in which $f$ decreases the fastest. We can do this using the directional derivative:
$\min_{v, v^Tv = 1} \nabla_x f(x) \cdot  v= \min_{v, v^Tv = 1} ||\nabla_x f(x)||_2 ||v||_2 \cos \theta$
where θ is the angle between $v$ and the gradient. Substituting in $||v||_2= 1$ and ignoring factors that do not depend on $v$, this simpliﬁes to $\min_{v}cos θ$.

This is minimized when $v$ points in the opposite direction as the gradient. In otherwords, the gradient points directly uphill, and the negative gradient points directly down hill. We can decrease $f$ by moving in the direction of the negative gradient.

Hence we have $x' = x-\alpha \frac{df(x)}{dx}$ where $\alpha$ is the learning rate, a positive scalar determining the size of the step.

Derivation of the Gradient Descent Formula

Epoch is every iteration of gradient descent through the entire training set.

Epoch in Gradient Descent

Batch gradient descent (batch size = N) takes relatively low noise, relatively large steps. And you could just keep matching to the minimum. However, it may take a long time to process and need additional memory.

Stochastic gradient descent (batch size = 1)  is easy to fit in memory and efficient for large datasets. But it can be extremely noisy since sometimes you hit in the wrong direction if that a training example happens to point in a bad direction. It won't ever converge, and will always just kind of oscillate and wander around the region of the minimum. 

in practice, mini-batch gradient descent with batch size in between 1 and N works better. It's not guaranteed to always head toward the minimum but it tends to head more consistently in direction of the minimum. 

Batch vs Stochastic vs Mini-Batch Gradient Descent

For logistic regression, the gradient is given by ∂∂θjJ(θ)=1m∑mi=1(hθ(x(i))−y(i))x(i)j. Which of these is a correct gradient descent update for logistic regression with a learning rate of α?

Does adding polynomial features (e.g., instead using $h\theta(x)=g(\theta0+\theta1x1+\theta2x2+\theta3x21+\theta4x1x2+\theta5x22) )$ could increase how well we can fit the training data?

Suppose you have the following training set, and fit a logistic regression classifier $h\theta(x)=g(\theta0+\theta1x1+\theta2x2)$.

The objective of backpropagation is to change the weights for the neurons, in order to bring the error function to a minimum with the help of gradient descent. 

Backpropagation calculates how much the final output values are affected by each of the weights. To do this, it calculates partial derivatives, going back from the error function to the neuron that carried a specific weight.

Learn Before

Related

Learn After