Learn Before
Activity (Process)

Data Parallelism Training Process

In a data parallelism setup, the training process follows a specific sequence during each iteration. First, a random minibatch of training data is split into kk equal portions and distributed across the kk available GPUs. Each GPU then independently calculates the loss and the local gradients of the model parameters using its assigned data subset. These local gradients from all kk GPUs are subsequently aggregated (via an allreduce operation) to compute the overall minibatch stochastic gradient. This aggregated gradient is redistributed to every GPU, and each GPU uses it to independently update its complete set of model parameters, ensuring synchronization across the system.

Image 0

0

1

Updated 2026-05-24

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L

Related