Batch size controls the accuracy of the estimate of the error gradient when training neural networks. The model tends to stabilize more towards the end of the run. Smaller batch sizes are used for two main reasons: 1) Smaller batch sizes are noisy, offering a regularizing effect and lower generalization error. Smaller batch sizes make it easier to fit one batch worth of training data in memory (e.g. when using a GPU). The batch size is often set at some small values such as 32, and is not tuned by the practitioner.

University of Michigan - Ann Arbor

Although increasing the minibatch size $$\mathcal{B}_t$$ reduces the variance of gradient estimates, this benefit exhibits diminishing returns. Beyond a certain point, the additional reduction in standard deviation becomes minimal relative to the linear increase in computational cost per iteration. Therefore, in practice, the minibatch size is chosen to be large enough to offer good computational efficiency and stable gradient estimates, while still fitting within the memory constraints of the hardware, such as a GPU.

Learn Before

Related