Batch Gradient Descent requires the entire dataset to be processed to complete one step. Gradient Descent can require many steps in some cases which causes this procedure to be very slow and inefficient for large datasets.

Mini-Batch Gradient Descent solves this problem by taking small groups of random data points called mini-batches and using them to estimate each step.  The mini-batches are usually a size greater than 1 but less than N (dataset size). The smaller steps allow for a much faster algorithm.

Mini-Batch Gradient Descent

The basic idea of gradient descent with momentum is to compute an exponentially weighted average of your gradients, and then use that gradient to update your weights instead. It almost always works faster than the standard gradient descent algorithm.


Gradient Descent with Momentum

Here is a very helpful article on different types of optimizer algorithms
https://ruder.io/optimizing-gradient-descent/index.html

An overview of gradient descent optimization algorithms

Learning rate decay is the gradual reduction of the learning rate as a function of time to speed up the learning algorithm. Decaying the learning rate as the gradient descent approaches completion reduces noise and facilitates a tighter convergence to a target.

Learning Rate Decay

Gradient descent is a fundamental optimization algorithm that leverages gradients to minimize a model's loss function. Because the gradient of a function points in the direction of steepest ascent, moving the model's parameters in the opposite direction iteratively lowers the loss. Each step of such gradient-based optimization algorithms requires calculating the exact gradient of the loss with respect to the parameters.

Gradient Descent

Adam stands for adaptive moment estimation.
It combines gradient descent with momentum, and RMSProp. It brings the benefits from both sides - adaptive learning rate and faster convergence with momentum.

Adam (Deep Learning Optimization Algorithm)

- Stands for Root Mean Square Propagation
- RMSProp is an optimization algorithm closely related to AdaGrad, as both employ the square of the gradient to scale the update coefficients on a per-coordinate basis. However, RMSProp overcomes AdaGrad's tendency for radically diminishing learning rates by using a leaky (exponentially weighted) average of squared gradients rather than a cumulative sum.
- RMSProp also shares the leaky averaging mechanism with the momentum method, but applies it differently: whereas momentum uses leaky averaging to smooth the gradient direction, RMSProp uses the technique to adjust the coefficient-wise preconditioner that rescales the learning rate independently for each parameter.
- Because RMSProp does not automatically schedule the learning rate (unlike AdaGrad, whose learning rate decays implicitly through accumulation), the learning rate must be explicitly scheduled by the practitioner in practice.
- The decay coefficient $$\gamma$$ governs how long the gradient history is retained when adjusting the per-coordinate scale: a larger $$\gamma$$ produces a longer memory, while a smaller $$\gamma$$ makes the algorithm more responsive to recent gradients.

RMSprop (Deep Learning Optimization Algorithm)

In the momentum method we basically first moved our weight in the direction of the current gradient and then moved in the direction  of momentum (weighted sum of all previous steps). Now in the new method we first move in the direction of the momentum and then calculate the gradient at the new point. Using this gradient we move in the direction of the new gradient. 

The update rules are as follows:
$$v \leftarrow \alpha v - \epsilon \nabla_{\theta} [\frac{1}{m} \sum^{m}_{i=1} L(f(x^{(i)};\theta + \alpha v), y^{(i)})]$$
$$\theta \leftarrow \theta + v$$

Nesterov momentum (Deep Learning Optimization Algorithm)

- **Local optima**: it's actually unlikely to get stuck in local optima.
- **Cliffs**: on the face of an extremely steep cliﬀ structure, the
gradient update step can move the parameters extremely far
- **Inexact Gradients**: sometimes approximation is needed for gradients
- **Plateaus**: low cost function slope (close to flat) makes learning slow.

Challenges with Deep Learning Optimizer Algorithms

Adam stands for: adaptive moment estimation. Briefly, this method combines momentum and RMSprop (root mean squared prop).
Like momentum alone, RMSprop smooths the gradient, (it takes RMSProp and applies momentum to the rescaled gradients). This alternative approach is best explained mathematically:

Adam introduces four hyperparameters:
- learning rate alpha
- beta from momentum (usually 0.9)
- beta2 from RMSprop (usually 0.999)
- epsilon (usually 1e-8)

As mentioned above, you usually do not need to tune beta, beta2, and epsilon as the values listed above will generally work well. Only the learning rate is left to tune in order to accelerate training.


Adam combines the advantages of AdaGrad and RMSProp these two optimization algorithms. It comprehensively considers the first moment estimation of the gradient (First Moment Estimation, the mean value of the gradient) and the second moment estimation (Second Moment Estimation, the uncentered variance of the gradient), and calculates the update step size.

Adam optimization algorithm


Adam is different to classical stochastic gradient descent (SGD). SGD maintains a single learning rate (alpha) for all weight updates and the learning rate does not change during training. Adam combines the advantages of AdaGrad and RMSProp. It not only adapts the parameter learning rates based on the average first moment (the mean) as in RMAProp, but also makes use of the average of the second moments of the gradients (the uncentered variance).

Difference between Adam and SGD

The Adagrad optimization algorithm addresses the difficulty of tuning learning rates for sparse features by replacing simple feature occurrence counters with an aggregate of the squares of previously observed gradients. Specifically, it uses $$s(i, t+1) = s(i, t) + \left(\partial_i f(\mathbf{x})\right)^2$$ to adjust the learning rate. This automatically scales down the step size significantly for coordinates that frequently have large gradients, while applying a gentler treatment to coordinates with small gradients, thereby eliminating the need to manually decide when a gradient is considered large enough.

Adagrad

Adadelta is an optimization algorithm that has no explicit learning rate parameter. Instead, it uses the rate of change in the parameters themselves to dynamically adapt the learning rate. To accomplish this, the algorithm utilizes two specific state variables: $$\mathbf{s}_t$$ to track a leaky average of the second moment of the gradient, and $$\Delta\mathbf{x}_t$$ to track a leaky average of the second moment of the model's parameter changes. The algorithm retains standard naming conventions for these variables to maintain consistency with similar optimization methods like momentum, AdaGrad, and RMSProp.

Adadelta

When a neural network trains, it uses an algorithm to determine the optimal weights for the model, called an optimizer. 
There are several optimizer algorithms, such as:
- Gradient descent
- Mini-batch gradient descent
- Gradient descent with momentum
- RMSprop
- Adam optimization algorithm
- Nesterov momentum
- AdaGrad

University of Michigan - Ann Arbor

University of California, Berkeley

Claude

Hyperparameters related to neural network structure:
- Number of hidden layers (Depth)
- Number of hidden units (Width)
- Dropout method
- Activation function for each layer
- Weights Initialization

Hyperparameters related to training algorithm:
- Learning rate $\alpha$
- Momentum parameter $\beta \sim 0.9$
- $\beta_1 \sim 0.9, \beta_2 \sim 0.999, \epsilon \sim 10^{-8}$
- Number of Gradient descent iterations
- Mini-batch size 
- Optimizer algorithm
- Learning rate decay
- Regularization rate $\lambda$ 

List of Common Hyperparameters in Deep Learning

A helpful website that introduces neural networks:
https://missinglink.ai/guides/neural-network-concepts/

Neural Network Reference

https://www.coursera.org/learn/deep-neural-network?specialization=deep-learning

Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

Dive into Deep Learning

For a feedforward neural network, the depth of the network is the number of hidden layers plus one (as the output layer is also parameterized). The width of the network is the dimensionality of its hidden layers.

The main architectural considerations for designing a neural network are choosing the depth of the network and the width of each layer.


Depth and Width for Neural Networks

Dropout means what percentage of neurons should be randomly “killed” during each epoch to prevent overfitting.

Dropout

Neural network learning rate means how fast the backpropagation algorithm performs gradient descent. A larger learning rate makes the network train faster but might result in missing the minimum of the loss function.

Neural Network Learning Rate

When you try to train a huge amount of data with limited computer memory, you can separate the whole training set to several batches that can fit into your computer memory. Then you feed these batches to your model one by one. After feeding all batches once, you complete one epoch. To successfully train your model, you need multiple epochs.

Epochs in Machine Learning

In a neural network, numeric data points, called inputs, are fed into the neurons in the input layer. Each neuron has a weight, and multiplying the input number with the weight gives the output of the neuron, which is transferred to the next layer.
The activation function is a mathematical “gate” in between the input feeding the current neuron and its output going to the next layer. It can be as simple as a step function that turns the neuron output on and off, depending on a rule or threshold. Or it can be a transformation that maps the input signals into output signals that are needed for the neural network to function.
Activation functions are mathematical equations that determine the output of a neural network. The function is attached to each neuron in the network, and determines whether it should be activated (“fired”) or not, based on whether each neuron’s input is relevant for the model’s prediction. Activation functions also help normalize the output of each neuron to a range between 1 and 0 or between -1 and 1. When used in a multilayer neural network, activation functions can be different for different layers. 

Activation Functions in Neural Networks

Deep Learning Optimizer Algorithms

Initial weights are applied to all the neurons.
It is necessary to set initial weights for the first forward pass. Two basic options are to set weights to zero or to randomize them. However, this can result in a vanishing or exploding gradient, which will make it difficult to train the model. To mitigate this problem, you can use a heuristic (a formula tied to the number of neuron layers) to determine the weights. A common heuristic used for the Tanh activation is called Xavier initialization.

On top of that if the architecture being used has already been trained by someone else (such as ImageNet) the node values can be initialized to those values. This is known as transfer learning and it can be very useful to substantially increase training speed and model accuracy.

Deep Learning Weight Initialization

- Grid search
- Randomly choosing
- Coarse to fine

Hyperparameters Tuning Methods in Deep Learning

A model parameter is a configuration variable that is internal to the model and whose value can be estimated from data. It is required by the model when making predictions and estimated or learned from data. It is often not set manually by the practitioner. E.g. The coefficients in a linear regression or logistic regression; the weights in an artificial neural network, etc.

A model hyperparameter is a configuration that is external to the model and whose value cannot be estimated from data. It is often used in processes to help estimate model parameters. It is often specified by the practitioner. We may use rules of thumb, copy values used on other problems, or search for the best value by trial and error. E.g. The learning rate for training a neural network; the K in K-nearest neighbors, etc.

Difference between Model Parameter and Model Hyperparameter

The trade-off between the standard prediction loss and the additive weight decay penalty is characterized by the regularization constant, $$\lambda$$. This nonnegative hyperparameter is fit using validation data and modifies the objective to $$L(\mathbf{w}, b) + \frac{\lambda}{2} \|\mathbf{w}\|^2$$. When $$\lambda = 0$$, the original loss function is recovered. For $$\lambda > 0$$, the size of the weights is restricted, with larger values of $$\lambda$$ constraining the weights more considerably. The penalty term is divided by $$2$$ by convention so that the constant cancels out gracefully when the derivative of the quadratic function is taken.

Regularization Constant

Batch normalization is a technique designed to accelerate and stabilize the training of deep neural networks. Mechanistically, it centers and rescales the intermediate layer activations back to a controlled mean and variance, preventing their distributions from diverging across layers and over time. By keeping these intermediate values on a comparable scale, batch normalization enables the use of more aggressive learning rates. The technique was originally motivated by the concept of covariate shift applied to internal layers, but the hypothesis that it works by reducing this so-called internal covariate shift has since been challenged and does not appear to be a valid explanation for its effectiveness. Although intuitively thought to make the optimization landscape smoother, the precise mechanism by which batch normalization aids training remains an open research question. Despite this theoretical uncertainty, batch normalization has proven indispensable in practice, being applied in nearly all deployed image classifiers and earning the original paper tens of thousands of citations.

Learn Before

Related

Learn After