Minibatch SGD Runtime Advantage Over Batch GD and SGD
When minibatch stochastic gradient descent is applied to the Airfoil Self-Noise dataset with a batch size of and a learning rate of 0.4, it requires approximately 0.025 seconds per epoch—comparable to full-batch gradient descent—while converging substantially faster in terms of loss reduction. Reducing the batch size to (with learning rate 0.05) increases the epoch time to about 0.090 seconds because smaller batches are less efficient to execute on hardware, yet this is still faster than pure SGD. A final time-versus-loss comparison plotted on a logarithmic time axis across all four methods (full-batch GD, SGD, batch size , and batch size ) confirms that although SGD converges faster than GD in terms of examples processed, it uses more wall-clock time to reach the same loss because per-example gradient computation is not as efficient. Minibatch SGD achieves the best overall trade-off: a batch size of can even outperform full-batch GD in total runtime by making parameter updates per epoch instead of just one, while retaining the computational benefits of vectorized operations.
0
1
Tags
D2L
Dive into Deep Learning @ D2L