For a finite sample size $$n$$, the empirical data distribution is modeled as a discrete probability distribution $$p(x, y) = \frac{1}{n} \sum_{i=1}^n \delta_{x_i}(x) \delta_{y_i}(y)$$, where $$\delta$$ denotes the Dirac delta function. This discrete distribution theoretically justifies performing stochastic gradient descent over a finite dataset by drawing independent samples $$(x_i, y_i)$$ from it.

Claude

If we choose the mini-batch size to be 1, then it gives the algorithm called Stochastic Gradient Descent or SGD.

In this case, on every iteration, you're taking gradient descent with just a single training example
$w = w - \alpha \nabla_w J(x^i, y^i; w)$

The most important property of SGD is that computation time per step does not grow with the number of examples. This makes SGD very efficient with large training sets.

The learning rate is a hyperparameter that must be adjusted. Unlike regular parameters of a model (weights like w and b), which are learned by the algorithm from the training set, hyperparameters are special parameters chosen by the algorithm designer that affect how the algorithm works.

Stochastic Gradient Descent Algorithm

Dive into Deep Learning

- Adam is fast, but tends to overfit
- SGD is slow but gives great results
- RMSProp sometimes works best
- SWA can easily improve quality
- AdaTune magically improves the learning rate

Adam vs. SGD vs. RMSProp vs. SWA vs. AdaTune

Finite Sample Distribution for Stochastic Gradient Descent

Optimality guarantees for stochastic gradient descent are generally unavailable when dealing with nonconvex objectives. In such cases, the number of local minima that would require checking to confirm a global optimum could be exponentially large, making theoretical guarantees intractable.

Lack of Optimality Guarantees in Nonconvex Optimization

A minimal from-scratch implementation of the stochastic gradient descent optimizer defines a function sgd(params, states, hyperparams) that accepts three arguments: a list of model parameters, optimizer states (unused for vanilla SGD), and a dictionary of hyperparameters. For each parameter tensor, the function subtracts the product of the learning rate and the parameter's gradient using an in-place operation, then zeroes the gradient. In PyTorch:

python
def sgd(params, states, hyperparams):
    for p in params:
        p.data.sub_(hyperparams['lr'] * p.grad)
        p.grad.data.zero_()


This function signature—taking params, states, and hyperparams—is deliberately general so that more advanced optimizers introduced later (e.g., momentum, Adam) can share the same calling convention by making use of the states argument for maintaining auxiliary variables.

SGD Optimizer From-Scratch Implementation

Batch gradient descent uses the entire dataset of size $$N$$ as a single batch. It produces low-noise gradient estimates and takes large, reliable steps toward the minimum, but requires considerable time per iteration and significant memory. Stochastic gradient descent (SGD) uses a batch size of 1. It is memory-efficient, but extremely noisy because individual examples may point in poor directions, causing SGD to oscillate rather than converge directly. Minibatch gradient descent uses a batch size between 1 and $$N$$. It offers a practical compromise by balancing convergence speed and computational efficiency. Although SGD converges faster than batch gradient descent in terms of examples processed, computing the gradient example-by-example is computationally inefficient. Minibatch gradient descent leverages hardware optimization (such as vectorization), allowing intermediate batch sizes (e.g., 100) to often outperform both extremes in overall wall-clock runtime.

Learn Before

Related