$step_{t} = \beta step_{t-1} + \alpha  \nabla J(W^{t-1} - \beta step_{t-1}  ) $
$W^{t} = W^{t-1} - step_{t} $
 ------------------------------
$\nabla J(W^{t-1} - \beta step_{t-1}  ) $ - is the gradient calculated for the new point where J is the cost function

$step_{t}$ - step at time stamp t

$W^{t}$ - the parameters for the layer at the time stamp t

$\alpha$ - learning rate

$\beta$ - another hyperparameter(mostly people use use 0.9)

            Also the same process is done for bias parameters

University of California, Berkeley

In the momentum method we basically first moved our weight in the direction of the current gradient and then moved in the direction  of momentum (weighted sum of all previous steps). Now in the new method we first move in the direction of the momentum and then calculate the gradient at the new point. Using this gradient we move in the direction of the new gradient. 

The update rules are as follows:
$$v \leftarrow \alpha v - \epsilon \nabla_{\theta} [\frac{1}{m} \sum^{m}_{i=1} L(f(x^{(i)};\theta + \alpha v), y^{(i)})]$$
$$\theta \leftarrow \theta + v$$

Nesterov momentum (Deep Learning Optimization Algorithm)

Pros:
- In practice Nesterov Momentum works better than Momentum Method

Cons:
- Both of the momentum algorithms are very sensitive to the choice of the learning rate

Pros and Cons of Nesterov Momentum

Nesterov algorithm formula

Nesterov example:

 - First lets say our parameter approached to the edge of the local or global minima
 - Then from momentum parameters would move further down.
 - Once its at the down we check the gradient of that new point (now it is almost zero so we continue to converge)

Simple momentum example:
 
 - First lets say our parameter approached to the edge of the local or global minima
 - Then from momentum parameters would move further down.
 - Once its at the down instead staying there we are going out of the minima because we had to use the gradient from the previous position which was very big( and we might even diverge)

Learn Before

$step_{t} = \beta step_{t-1} + \alpha \nabla J(W^{t-1} - \beta step_{t-1} )$ $W^{t} = W^{t-1} - step_{t}$

Related

Learn Before

stept=βstept−1+α∇J(Wt−1−βstept−1)step_{t} = \beta step_{t-1} + \alpha \nabla J(W^{t-1} - \beta step_{t-1} ) stept​=βstept−1​+α∇J(Wt−1−βstept−1​) Wt=Wt−1−steptW^{t} = W^{t-1} - step_{t} Wt=Wt−1−stept​

Related

$step_{t} = \beta step_{t-1} + \alpha \nabla J(W^{t-1} - \beta step_{t-1} )$ $W^{t} = W^{t-1} - step_{t}$