- Adam is fast, but tends to overfit
- SGD is slow but gives great results
- RMSProp sometimes works best
- SWA can easily improve quality
- AdaTune magically improves the learning rate

Adam vs. SGD vs. RMSProp vs. SWA vs. AdaTune

For a finite sample size $$n$$, the empirical data distribution is modeled as a discrete probability distribution $$p(x, y) = \frac{1}{n} \sum_{i=1}^n \delta_{x_i}(x) \delta_{y_i}(y)$$, where $$\delta$$ denotes the Dirac delta function. This discrete distribution theoretically justifies performing stochastic gradient descent over a finite dataset by drawing independent samples $$(x_i, y_i)$$ from it.

Finite Sample Distribution for Stochastic Gradient Descent

Optimality guarantees for stochastic gradient descent are generally unavailable when dealing with nonconvex objectives. In such cases, the number of local minima that would require checking to confirm a global optimum could be exponentially large, making theoretical guarantees intractable.

Lack of Optimality Guarantees in Nonconvex Optimization

A minimal from-scratch implementation of the stochastic gradient descent optimizer defines a function sgd(params, states, hyperparams) that accepts three arguments: a list of model parameters, optimizer states (unused for vanilla SGD), and a dictionary of hyperparameters. For each parameter tensor, the function subtracts the product of the learning rate and the parameter's gradient using an in-place operation, then zeroes the gradient. In PyTorch:

python
def sgd(params, states, hyperparams):
    for p in params:
        p.data.sub_(hyperparams['lr'] * p.grad)
        p.grad.data.zero_()


This function signature—taking params, states, and hyperparams—is deliberately general so that more advanced optimizers introduced later (e.g., momentum, Adam) can share the same calling convention by making use of the states argument for maintaining auxiliary variables.

SGD Optimizer From-Scratch Implementation

Batch gradient descent (batch size = $$N$$) produces low-noise gradient estimates and takes large, reliable steps toward the minimum. However, it may require considerable time per iteration and significant additional memory.

Stochastic gradient descent (batch size = $$1$$) is memory-efficient and well-suited for large datasets. However, it is extremely noisy because individual training examples may point in poor directions. SGD tends to oscillate and wander around the region of the minimum rather than converging directly to it.

Minibatch gradient descent (batch size between $$1$$ and $$N$$) offers a practical compromise. Although it does not guarantee monotonic progress toward the minimum, it tends to head more consistently in the right direction.

Experimentally, while SGD converges faster than batch GD in terms of the number of examples processed, it consumes more wall-clock time because computing the gradient example by example is computationally less efficient. Minibatch SGD balances convergence speed and computation efficiency: for instance, a batch size of $$100$$ can even outperform full-batch GD in runtime.

Batch vs Stochastic vs Mini-Batch Gradient Descent

If we choose the mini-batch size to be 1, then it gives the algorithm called Stochastic Gradient Descent or SGD.

In this case, on every iteration, you're taking gradient descent with just a single training example
$w = w - \alpha \nabla_w J(x^i, y^i; w)$

The most important property of SGD is that computation time per step does not grow with the number of examples. This makes SGD very efficient with large training sets.

The learning rate is a hyperparameter that must be adjusted. Unlike regular parameters of a model (weights like w and b), which are learned by the algorithm from the training set, hyperparameters are special parameters chosen by the algorithm designer that affect how the algorithm works.

University of Michigan - Ann Arbor

Claude

State University of New York at Stony Brook

Batch Gradient Descent requires the entire dataset to be processed to complete one step. Gradient Descent can require many steps in some cases which causes this procedure to be very slow and inefficient for large datasets.

Mini-Batch Gradient Descent solves this problem by taking small groups of random data points called mini-batches and using them to estimate each step.  The mini-batches are usually a size greater than 1 but less than N (dataset size). The smaller steps allow for a much faster algorithm.

Mini-Batch Gradient Descent

A helpful website for understanding gradient descent:
https://towardsdatascience.com/gradient-descent-explained-9b953fc0d2c

Gradient Descent Reference

https://www.coursera.org/learn/deep-neural-network?specialization=deep-learning

Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

Dive into Deep Learning

If you have a huge training set with 5000000 training samples,
$X=[ x^{(1)},  x^{(2)}  ,...,  x^{(5000000)}]$
Let's say each of your baby training sets have just 1,000 examples each. So, you take the first mini-batch as $X^{\{ 1\}} =[x^{(1)},  x^{(2)}  ,...,  x^{(1000)}] $. And then you take home the next 1,000 samples $X^{\{2\}} =[x^{(1001)},  ...,  x^{(2000)}] $ and so on.

Altogether you would have 5,000 of these mini-batches and then similarly you do the same thing for Y. Hence we end up with mini-batches $X^{\{ T\}}, Y^{\{ T\}}$, T = 1,2...,5000.

An Example of Mini-Batches

First, we take a constant learning rate represented by the blue line. We see that, as we iterate, the steps are large and noisy and do not converge on a minimum. Instead, it wanders around the minimum.

Next, we take a decaying learning rate represented by the green line. At the start, the learning rate takes large steps with each iteration. But the learning rate is reduced or decayed as it approaches the minimum. This slower learning rate takes smaller tighter steps around the minimum and is closer to convergence.

This method allows us to have relatively fast learning during the initial phases with large steps, but also converge to a minimum during the final phases with slower learning rates and smaller steps.

Example Using Mini-Batch Gradient Descent (Learning Rate Decay)

Which of these statements about mini-batch gradient descent do you agree with?

Why is the best mini-batch size usually not 1 and not m, but instead something in-between?

Which of the following do you agree with?

Suppose your learning algorithm’s cost J, plotted as a function of the number of iterations, looks like the image below:

Stochastic Gradient Descent Algorithm

The expression $$\frac{\partial L_{\theta_t}(\mathcal{D}_{\mathrm{mini}})}{\partial \theta_t}$$ represents the gradient of the loss function, $$L$$, with respect to the model parameters, $$\theta_t$$. This gradient is computed on a specific mini-batch of training samples, $$\mathcal{D}_{\mathrm{mini}}$$, and indicates the direction of the steepest increase in the loss for that batch.

Loss Gradient over a Mini-batch

Although increasing the minibatch size $$\mathcal{B}_t$$ reduces the variance of gradient estimates, this benefit exhibits diminishing returns. Beyond a certain point, the additional reduction in standard deviation becomes minimal relative to the linear increase in computational cost per iteration. Therefore, in practice, the minibatch size is chosen to be large enough to offer good computational efficiency and stable gradient estimates, while still fitting within the memory constraints of the hardware, such as a GPU.

Minibatch Size Selection Trade-off

For $$t = 1, 2, \ldots, N$$ (where $$N$$ is the number of mini-batches):

1. Forward propagate on mini-batch $$X^{\{t\}}$$.
2. Compute the cost function $$J^{\{t\}}$$ for that mini-batch.
3. Backpropagate to compute gradients with respect to $$J^{\{t\}}$$, using $$X^{\{t\}}$$ and $$Y^{\{t\}}$$.
4. Update parameters: $$W^{[l]} = W^{[l]} - \alpha \, dW^{[l]}$$, $$b^{[l]} = b^{[l]} - \alpha \, db^{[l]}$$, where $$\alpha$$ is the learning rate and $$l$$ indexes each layer.

One complete pass through all $$N$$ mini-batches constitutes one epoch of training.

Learn Before

Related

Learn After