University of Michigan - Ann Arbor

Batch Gradient Descent requires the entire dataset to be processed to complete one step. Gradient Descent can require many steps in some cases which causes this procedure to be very slow and inefficient for large datasets.

Mini-Batch Gradient Descent solves this problem by taking small groups of random data points called mini-batches and using them to estimate each step.  The mini-batches are usually a size greater than 1 but less than N (dataset size). The smaller steps allow for a much faster algorithm.

Mini-Batch Gradient Descent

Epoch is every iteration of gradient descent through the entire training set.

Epoch in Gradient Descent

If you have a huge training set with 5000000 training samples,
$X=[ x^{(1)},  x^{(2)}  ,...,  x^{(5000000)}]$
Let's say each of your baby training sets have just 1,000 examples each. So, you take the first mini-batch as $X^{\{ 1\}} =[x^{(1)},  x^{(2)}  ,...,  x^{(1000)}] $. And then you take home the next 1,000 samples $X^{\{2\}} =[x^{(1001)},  ...,  x^{(2000)}] $ and so on.

Altogether you would have 5,000 of these mini-batches and then similarly you do the same thing for Y. Hence we end up with mini-batches $X^{\{ T\}}, Y^{\{ T\}}$, T = 1,2...,5000.

An Example of Mini-Batches

First, we take a constant learning rate represented by the blue line. We see that, as we iterate, the steps are large and noisy and do not converge on a minimum. Instead, it wanders around the minimum.

Next, we take a decaying learning rate represented by the green line. At the start, the learning rate takes large steps with each iteration. But the learning rate is reduced or decayed as it approaches the minimum. This slower learning rate takes smaller tighter steps around the minimum and is closer to convergence.

This method allows us to have relatively fast learning during the initial phases with large steps, but also converge to a minimum during the final phases with slower learning rates and smaller steps.

Example Using Mini-Batch Gradient Descent (Learning Rate Decay)

Which of these statements about mini-batch gradient descent do you agree with?

Why is the best mini-batch size usually not 1 and not m, but instead something in-between?

Which of the following do you agree with?

Suppose your learning algorithm’s cost J, plotted as a function of the number of iterations, looks like the image below:

If we choose the mini-batch size to be 1, then it gives the algorithm called Stochastic Gradient Descent or SGD.

In this case, on every iteration, you're taking gradient descent with just a single training example
$w = w - \alpha \nabla_w J(x^i, y^i; w)$

The most important property of SGD is that computation time per step does not grow with the number of examples. This makes SGD very efficient with large training sets.

The learning rate is a hyperparameter that must be adjusted. Unlike regular parameters of a model (weights like w and b), which are learned by the algorithm from the training set, hyperparameters are special parameters chosen by the algorithm designer that affect how the algorithm works.

Stochastic Gradient Descent Algorithm

The expression $$\frac{\partial L_{\theta_t}(\mathcal{D}_{\mathrm{mini}})}{\partial \theta_t}$$ represents the gradient of the loss function, $$L$$, with respect to the model parameters, $$\theta_t$$. This gradient is computed on a specific mini-batch of training samples, $$\mathcal{D}_{\mathrm{mini}}$$, and indicates the direction of the steepest increase in the loss for that batch.

Loss Gradient over a Mini-batch

Although increasing the minibatch size $$\mathcal{B}_t$$ reduces the variance of gradient estimates, this benefit exhibits diminishing returns. Beyond a certain point, the additional reduction in standard deviation becomes minimal relative to the linear increase in computational cost per iteration. Therefore, in practice, the minibatch size is chosen to be large enough to offer good computational efficiency and stable gradient estimates, while still fitting within the memory constraints of the hardware, such as a GPU.

Minibatch Size Selection Trade-off

For $$t = 1, 2, \ldots, N$$ (where $$N$$ is the number of mini-batches):

1. Forward propagate on mini-batch $$X^{\{t\}}$$.
2. Compute the cost function $$J^{\{t\}}$$ for that mini-batch.
3. Backpropagate to compute gradients with respect to $$J^{\{t\}}$$, using $$X^{\{t\}}$$ and $$Y^{\{t\}}$$.
4. Update parameters: $$W^{[l]} = W^{[l]} - \alpha \, dW^{[l]}$$, $$b^{[l]} = b^{[l]} - \alpha \, db^{[l]}$$, where $$\alpha$$ is the learning rate and $$l$$ indexes each layer.

One complete pass through all $$N$$ mini-batches constitutes one epoch of training.

Mini-Batch Gradient Descent Algorithm

$$\alpha = \frac{1}{1 + decay\_rate \cdot epoch\_num} \alpha_0$$, where $$\alpha$$ is the learning rate in the current epoch, $$\alpha_0$$ is the initial learning rate, $$epoch\_num$$ is the current epoch, and $$decay\_rate$$ is the selected decay rate. The decay rate is a tunable hyperparameter. Initializing $$decay\_rate = 1$$ and $$\alpha_0 = 0.2$$, we can graph an example with $$epoch\_num$$ on the x-axis and $$\alpha$$ on the y-axis to observe the decay of the learning rate.

Learn Before

Related