Example

SGD Slower Wall-Clock Time Despite Faster Per-Example Convergence

When stochastic gradient descent is applied to the Airfoil Self-Noise dataset with a batch size of 11 and a learning rate of 0.0050.005, the model parameters are updated 1,5001{,}500 times per epoch (once per example). Although the objective function value declines rapidly within the first epoch in terms of examples processed, the wall-clock time per epoch is approximately 0.6850.685 seconds—over 3030 times slower than batch gradient descent's 0.0200.020 seconds per epoch. This occurs because processing individual observations one at a time cannot leverage hardware vectorization and incurs higher per-update overhead, making each gradient computation less efficient despite the more frequent parameter updates.

Image 0

0

1

Updated 2026-05-15

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L