Minibatch SGD Runtime Advantage Over Batch GD and SGD
When minibatch stochastic gradient descent is applied to the Airfoil Self-Noise dataset with a batch size of and a learning rate of , it requires approximately seconds per epoch—comparable to full-batch gradient descent—while converging substantially faster in terms of loss reduction. Reducing the batch size to (with learning rate ) increases the epoch time to about seconds because smaller batches are less efficient to execute on hardware, yet this is still faster than pure SGD. A final time-versus-loss comparison plotted on a logarithmic time axis across all four methods (full-batch GD, SGD, batch size , and batch size ) confirms that although SGD converges faster than GD in terms of examples processed, it uses more wall-clock time to reach the same loss because per-example gradient computation is not as efficient. Minibatch SGD achieves the best overall trade-off: a batch size of can even outperform full-batch GD in total runtime by making parameter updates per epoch instead of just one, while retaining the computational benefits of vectorized operations.
0
1
Tags
D2L
Dive into Deep Learning @ D2L