1Cademy - Minibatch SGD Runtime Advantage Over Batch GD and SGD

Learn Before

Batch vs Stochastic vs Mini-Batch Gradient Descent

Example

Minibatch SGD Runtime Advantage Over Batch GD and SGD

When minibatch stochastic gradient descent is applied to the Airfoil Self-Noise dataset with a batch size of 100 and a learning rate of 0.4, it requires approximately 0.025 seconds per epoch—comparable to full-batch gradient descent—while converging substantially faster in terms of loss reduction. Reducing the batch size to 10 (with learning rate 0.05) increases the epoch time to about 0.090 seconds because smaller batches are less efficient to execute on hardware, yet this is still faster than pure SGD. A final time-versus-loss comparison plotted on a logarithmic time axis across all four methods (full-batch GD, SGD, batch size 100, and batch size 10) confirms that although SGD converges faster than GD in terms of examples processed, it uses more wall-clock time to reach the same loss because per-example gradient computation is not as efficient. Minibatch SGD achieves the best overall trade-off: a batch size of 100 can even outperform full-batch GD in total runtime by making 15 parameter updates per epoch instead of just one, while retaining the computational benefits of vectorized operations.

Updated 2026-07-01

Contributors are:

Who are from:

References

Dive into Deep Learning

Learn Before

Related