When batch gradient descent is applied to the Airfoil Self-Noise dataset by setting the minibatch size equal to the total number of training examples ($$1{,}500$$), the model parameters are updated only once per epoch. With a learning rate of $$1$$ over $$10$$ epochs, the loss converges to approximately $$0.247$$ at a speed of about $$0.020$$ seconds per epoch. However, progress is minimal: after roughly $$6$$ parameter updates, the loss curve plateaus and further improvement stalls. This demonstrates the fundamental limitation of full-batch gradient descent—each epoch provides only a single update, so achieving fine convergence requires many epochs despite each individual epoch being fast to compute.

Claude

Batch gradient descent uses the entire dataset of size $$N$$ as a single batch. It produces low-noise gradient estimates and takes large, reliable steps toward the minimum, but requires considerable time per iteration and significant memory. Stochastic gradient descent (SGD) uses a batch size of 1. It is memory-efficient, but extremely noisy because individual examples may point in poor directions, causing SGD to oscillate rather than converge directly. Minibatch gradient descent uses a batch size between 1 and $$N$$. It offers a practical compromise by balancing convergence speed and computational efficiency. Although SGD converges faster than batch gradient descent in terms of examples processed, computing the gradient example-by-example is computationally inefficient. Minibatch gradient descent leverages hardware optimization (such as vectorization), allowing intermediate batch sizes (e.g., 100) to often outperform both extremes in overall wall-clock runtime.

Batch vs Stochastic vs Mini-Batch Gradient Descent

Dive into Deep Learning

Batch GD Slow Convergence on the Airfoil Dataset

When stochastic gradient descent is applied to the Airfoil Self-Noise dataset with a batch size of $$1$$ and a learning rate of $$0.005$$, the model parameters are updated $$1{,}500$$ times per epoch (once per example). Although the objective function value declines rapidly within the first epoch in terms of examples processed, the wall-clock time per epoch is approximately $$0.685$$ seconds—over $$30$$ times slower than batch gradient descent's $$0.020$$ seconds per epoch. This occurs because processing individual observations one at a time cannot leverage hardware vectorization and incurs higher per-update overhead, making each gradient computation less efficient despite the more frequent parameter updates.

SGD Slower Wall-Clock Time Despite Faster Per-Example Convergence

When minibatch stochastic gradient descent is applied to the Airfoil Self-Noise dataset with a batch size of 100 and a learning rate of 0.4, it requires approximately 0.025 seconds per epoch—comparable to full-batch gradient descent—while converging substantially faster in terms of loss reduction. Reducing the batch size to 10 (with learning rate 0.05) increases the epoch time to about 0.090 seconds because smaller batches are less efficient to execute on hardware, yet this is still faster than pure SGD. A final time-versus-loss comparison plotted on a logarithmic time axis across all four methods (full-batch GD, SGD, batch size 100, and batch size 10) confirms that although SGD converges faster than GD in terms of examples processed, it uses more wall-clock time to reach the same loss because per-example gradient computation is not as efficient. Minibatch SGD achieves the best overall trade-off: a batch size of 100 can even outperform full-batch GD in total runtime by making 15 parameter updates per epoch instead of just one, while retaining the computational benefits of vectorized operations.

Learn Before

Related