Comparison

Batch vs Stochastic vs Mini-Batch Gradient Descent

Batch gradient descent (batch size = NN) produces low-noise gradient estimates and takes large, reliable steps toward the minimum. However, it may require considerable time per iteration and significant additional memory.

Stochastic gradient descent (batch size = 11) is memory-efficient and well-suited for large datasets. However, it is extremely noisy because individual training examples may point in poor directions. SGD tends to oscillate and wander around the region of the minimum rather than converging directly to it.

Minibatch gradient descent (batch size between 11 and NN) offers a practical compromise. Although it does not guarantee monotonic progress toward the minimum, it tends to head more consistently in the right direction.

Experimentally, while SGD converges faster than batch GD in terms of the number of examples processed, it consumes more wall-clock time because computing the gradient example by example is computationally less efficient. Minibatch SGD balances convergence speed and computation efficiency: for instance, a batch size of 100100 can even outperform full-batch GD in runtime.

Image 0

0

2

Updated 2026-05-15

Tags

Data Science

D2L

Dive into Deep Learning @ D2L

Related