1Cademy - Minibatch Scaling in Data Parallelism

Learn Before

Data Parallelism

Concept

Minibatch Scaling in Data Parallelism

When distributing training across $k$ GPUs using data parallelism, it is standard practice to scale the overall minibatch size by a factor of $k$ . This ensures that each individual GPU processes the same amount of data, and performs an equivalent computational workload, as it would if training on a single GPU. Because this scaling considerably increases the effective minibatch size—such as a 16-fold increase on a 16-GPU server—the learning rate must typically be increased proportionally to maintain stable and efficient optimization.

Updated 2026-06-13

Contributors are:

Who are from:

References

Dive into Deep Learning
Dive into Deep Learning

Learn Before

Related