Example

Minibatch SGD Runtime Advantage Over Batch GD and SGD

When minibatch stochastic gradient descent is applied to the Airfoil Self-Noise dataset with a batch size of 100100 and a learning rate of 0.40.4, it requires approximately 0.0250.025 seconds per epoch—comparable to full-batch gradient descent—while converging substantially faster in terms of loss reduction. Reducing the batch size to 1010 (with learning rate 0.050.05) increases the epoch time to about 0.0900.090 seconds because smaller batches are less efficient to execute on hardware, yet this is still faster than pure SGD. A final time-versus-loss comparison plotted on a logarithmic time axis across all four methods (full-batch GD, SGD, batch size 100100, and batch size 1010) confirms that although SGD converges faster than GD in terms of examples processed, it uses more wall-clock time to reach the same loss because per-example gradient computation is not as efficient. Minibatch SGD achieves the best overall trade-off: a batch size of 100100 can even outperform full-batch GD in total runtime by making 1515 parameter updates per epoch instead of just one, while retaining the computational benefits of vectorized operations.

Image 0

0

1

Updated 2026-05-15

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L