Batch size controls the accuracy of the estimate of the error gradient when training neural networks. The model tends to stabilize more towards the end of the run. Smaller batch sizes are used for two main reasons: 1) Smaller batch sizes are noisy, offering a regularizing effect and lower generalization error. Smaller batch sizes make it easier to fit one batch worth of training data in memory (e.g. when using a GPU). The batch size is often set at some small values such as 32, and is not tuned by the practitioner.

Batch Size and Stability

Although increasing the minibatch size $$\mathcal{B}_t$$ reduces the variance of gradient estimates, this benefit exhibits diminishing returns. Beyond a certain point, the additional reduction in standard deviation becomes minimal relative to the linear increase in computational cost per iteration. Therefore, in practice, the minibatch size is chosen to be large enough to offer good computational efficiency and stable gradient estimates, while still fitting within the memory constraints of the hardware, such as a GPU.

University of Michigan - Ann Arbor

San Diego State University

Claude

Batch Gradient Descent requires the entire dataset to be processed to complete one step. Gradient Descent can require many steps in some cases which causes this procedure to be very slow and inefficient for large datasets.

Mini-Batch Gradient Descent solves this problem by taking small groups of random data points called mini-batches and using them to estimate each step.  The mini-batches are usually a size greater than 1 but less than N (dataset size). The smaller steps allow for a much faster algorithm.

Mini-Batch Gradient Descent

Dive into Deep Learning

If you have a huge training set with 5000000 training samples,
$X=[ x^{(1)},  x^{(2)}  ,...,  x^{(5000000)}]$
Let's say each of your baby training sets have just 1,000 examples each. So, you take the first mini-batch as $X^{\{ 1\}} =[x^{(1)},  x^{(2)}  ,...,  x^{(1000)}] $. And then you take home the next 1,000 samples $X^{\{2\}} =[x^{(1001)},  ...,  x^{(2000)}] $ and so on.

Altogether you would have 5,000 of these mini-batches and then similarly you do the same thing for Y. Hence we end up with mini-batches $X^{\{ T\}}, Y^{\{ T\}}$, T = 1,2...,5000.

An Example of Mini-Batches

First, we take a constant learning rate represented by the blue line. We see that, as we iterate, the steps are large and noisy and do not converge on a minimum. Instead, it wanders around the minimum.

Next, we take a decaying learning rate represented by the green line. At the start, the learning rate takes large steps with each iteration. But the learning rate is reduced or decayed as it approaches the minimum. This slower learning rate takes smaller tighter steps around the minimum and is closer to convergence.

This method allows us to have relatively fast learning during the initial phases with large steps, but also converge to a minimum during the final phases with slower learning rates and smaller steps.

Example Using Mini-Batch Gradient Descent (Learning Rate Decay)

Which of these statements about mini-batch gradient descent do you agree with?

Why is the best mini-batch size usually not 1 and not m, but instead something in-between?

Which of the following do you agree with?

Suppose your learning algorithm’s cost J, plotted as a function of the number of iterations, looks like the image below:

If we choose the mini-batch size to be 1, then it gives the algorithm called Stochastic Gradient Descent or SGD.

In this case, on every iteration, you're taking gradient descent with just a single training example
$w = w - \alpha \nabla_w J(x^i, y^i; w)$

The most important property of SGD is that computation time per step does not grow with the number of examples. This makes SGD very efficient with large training sets.

The learning rate is a hyperparameter that must be adjusted. Unlike regular parameters of a model (weights like w and b), which are learned by the algorithm from the training set, hyperparameters are special parameters chosen by the algorithm designer that affect how the algorithm works.

Stochastic Gradient Descent Algorithm

The expression $$\frac{\partial L_{\theta_t}(\mathcal{D}_{\mathrm{mini}})}{\partial \theta_t}$$ represents the gradient of the loss function, $$L$$, with respect to the model parameters, $$\theta_t$$. This gradient is computed on a specific mini-batch of training samples, $$\mathcal{D}_{\mathrm{mini}}$$, and indicates the direction of the steepest increase in the loss for that batch.

Loss Gradient over a Mini-batch

Minibatch Size Selection Trade-off

Batch gradient descent (batch size = $$N$$) produces low-noise gradient estimates and takes large, reliable steps toward the minimum. However, it may require considerable time per iteration and significant additional memory.

Stochastic gradient descent (batch size = $$1$$) is memory-efficient and well-suited for large datasets. However, it is extremely noisy because individual training examples may point in poor directions. SGD tends to oscillate and wander around the region of the minimum rather than converging directly to it.

Minibatch gradient descent (batch size between $$1$$ and $$N$$) offers a practical compromise. Although it does not guarantee monotonic progress toward the minimum, it tends to head more consistently in the right direction.

Experimentally, while SGD converges faster than batch GD in terms of the number of examples processed, it consumes more wall-clock time because computing the gradient example by example is computationally less efficient. Minibatch SGD balances convergence speed and computation efficiency: for instance, a batch size of $$100$$ can even outperform full-batch GD in runtime.

Batch vs Stochastic vs Mini-Batch Gradient Descent

For $$t = 1, 2, \ldots, N$$ (where $$N$$ is the number of mini-batches):

1. Forward propagate on mini-batch $$X^{\{t\}}$$.
2. Compute the cost function $$J^{\{t\}}$$ for that mini-batch.
3. Backpropagate to compute gradients with respect to $$J^{\{t\}}$$, using $$X^{\{t\}}$$ and $$Y^{\{t\}}$$.
4. Update parameters: $$W^{[l]} = W^{[l]} - \alpha \, dW^{[l]}$$, $$b^{[l]} = b^{[l]} - \alpha \, db^{[l]}$$, where $$\alpha$$ is the learning rate and $$l$$ indexes each layer.

One complete pass through all $$N$$ mini-batches constitutes one epoch of training.

Learn Before

Related

Learn After