SGD Slower Wall-Clock Time Despite Faster Per-Example Convergence
When stochastic gradient descent is applied to the Airfoil Self-Noise dataset with a batch size of and a learning rate of , the model parameters are updated times per epoch (once per example). Although the objective function value declines rapidly within the first epoch in terms of examples processed, the wall-clock time per epoch is approximately seconds—over times slower than batch gradient descent's seconds per epoch. This occurs because processing individual observations one at a time cannot leverage hardware vectorization and incurs higher per-update overhead, making each gradient computation less efficient despite the more frequent parameter updates.
0
1
Tags
D2L
Dive into Deep Learning @ D2L