Learn Before
Concept

Minibatch Scaling in Data Parallelism

When distributing training across kk GPUs using data parallelism, it is standard practice to scale the overall minibatch size by a factor of kk. This ensures that each individual GPU processes the same amount of data, and performs an equivalent computational workload, as it would if training on a single GPU. Because this scaling considerably increases the effective minibatch size—such as a 1616-fold increase on a 1616-GPU server—the learning rate must typically be increased proportionally to maintain stable and efficient optimization.

0

1

Updated 2026-05-18

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L

Related