Adam is different to classical stochastic gradient descent (SGD). SGD maintains a single learning rate (alpha) for all weight updates and the learning rate does not change during training. Adam combines the advantages of AdaGrad and RMSProp. It not only adapts the parameter learning rates based on the average first moment (the mean) as in RMAProp, but also makes use of the average of the second moments of the gradients (the uncentered variance).

University of Michigan - Ann Arbor

When a neural network trains, it uses an algorithm to determine the optimal weights for the model, called an optimizer. 
There are several optimizer algorithms, such as:
- Gradient descent
- Mini-batch gradient descent
- Gradient descent with momentum
- RMSprop
- Adam optimization algorithm
- Nesterov momentum
- AdaGrad

Deep Learning Optimizer Algorithms

Batch Gradient Descent requires the entire dataset to be processed to complete one step. Gradient Descent can require many steps in some cases which causes this procedure to be very slow and inefficient for large datasets.

Mini-Batch Gradient Descent solves this problem by taking small groups of random data points called mini-batches and using them to estimate each step.  The mini-batches are usually a size greater than 1 but less than N (dataset size). The smaller steps allow for a much faster algorithm.

Mini-Batch Gradient Descent

The basic idea of gradient descent with momentum is to compute an exponentially weighted average of your gradients, and then use that gradient to update your weights instead. It almost always works faster than the standard gradient descent algorithm.


Gradient Descent with Momentum

Here is a very helpful article on different types of optimizer algorithms
https://ruder.io/optimizing-gradient-descent/index.html

An overview of gradient descent optimization algorithms

Learning rate decay is the gradual reduction of the learning rate as a function of time to speed up the learning algorithm. Decaying the learning rate as the gradient descent approaches completion reduces noise and facilitates a tighter convergence to a target.

Learning Rate Decay

A mathematical technique that modifies the parameters of a function to descend from a high value of a function to a low value, by looking at the derivatives of the function with respect to each of its parameters, and seeing which step, via which parameter, is the next best step to minimize the function. Applying gradient descent to the error function helps find weights that achieve lower and lower error values, making the model gradually more accurate.

Gradient Descent

Similar to RMSProp, AdaDelta (Adaptive Delta) is a proposed method to compensate for the shortcomings of AdaGrad. In the same way as RMSProp, AdaDelta calculates the exponential mean instead of the sum when calculating the gradient sum of squares(often denoted G). Instead of simply using the step size as η, the exponential mean value is used with the square of the change value of the step size.

$G = \gamma G + (1-\gamma)(\nabla_{\theta}J(\theta_t))^2$
$\Delta_{\theta} =  \frac{\sqrt{s+\epsilon}}{\sqrt{G + \epsilon}} \cdot \nabla_{\theta}J(\theta_t)$
$\theta = \theta - \Delta_{\theta}$
$s = \gamma s + (1-\gamma) \Delta_{\theta}^2$

AdaDelta (Deep Learning Optimization Algorithm)

Adam stands for adaptive moment estimation.
It combines gradient descent with momentum, and RMSProp. It brings the benefits from both sides - adaptive learning rate and faster convergence with momentum.

Adam (Deep Learning Optimization Algorithm)

- Stands for Root Mean Square Propagation
- RMSprop is a batch learning algorithm similar to AdaGrad that aims to deal with radically diminishing learning rates.

- Many times, gradients may be tiny, and others may be huge, which makes learning difficult — trying to find a single global learning rate for the algorithm.
RMSprop looks at the step size that’s defined for that weight instead of the magnitude of the gradient. The step size adapts individually over time, so that we accelerate learning in the direction that we need. In this way, RMSProp mimics initializing an instance of AdaGrad in a locally convex bowl, allowing it to converge rapidly there

RMSprop (Deep Learning Optimization Algorithm)

So one of the big disadvantages of momentum and nesterov momentum algorithms is that they heavily rely on the learning rate. So AdaGrad is one of the algorithms that modifies the learning rate as we go. The intuition behind the adaptive learning rate is that it goes slower with frequent features and goes faster with features that happen rarely. 

AdaGrad (Deep Learning Optimization Algorithm)

In the momentum method we basically first moved our weight in the direction of the current gradient and then moved in the direction  of momentum (weighted sum of all previous steps). Now in the new method we first move in the direction of the momentum and then calculate the gradient at the new point. Using this gradient we move in the direction of the new gradient. 

The update rules are as follows:
$$v \leftarrow \alpha v - \epsilon \nabla_{\theta} [\frac{1}{m} \sum^{m}_{i=1} L(f(x^{(i)};\theta + \alpha v), y^{(i)})]$$
$$\theta \leftarrow \theta + v$$

Nesterov momentum (Deep Learning Optimization Algorithm)

- **Local optima**: it's actually unlikely to get stuck in local optima.
- **Cliffs**: on the face of an extremely steep cliﬀ structure, the
gradient update step can move the parameters extremely far
- **Inexact Gradients**: sometimes approximation is needed for gradients
- **Plateaus**: low cost function slope (close to flat) makes learning slow.

Challenges with Deep Learning Optimizer Algorithms

Adam stands for: adaptive moment estimation. Briefly, this method combines momentum and RMSprop (root mean squared prop).
Like momentum alone, RMSprop smooths the gradient, (it takes RMSProp and applies momentum to the rescaled gradients). This alternative approach is best explained mathematically:

Adam introduces four hyperparameters:
- learning rate alpha
- beta from momentum (usually 0.9)
- beta2 from RMSprop (usually 0.999)
- epsilon (usually 1e-8)

As mentioned above, you usually do not need to tune beta, beta2, and epsilon as the values listed above will generally work well. Only the learning rate is left to tune in order to accelerate training.


Adam combines the advantages of AdaGrad and RMSProp these two optimization algorithms. It comprehensively considers the first moment estimation of the gradient (First Moment Estimation, the mean value of the gradient) and the second moment estimation (Second Moment Estimation, the uncentered variance of the gradient), and calculates the update step size.

Learn Before

Related