Learn Before
Concept

Single-GPU Gradient Aggregation

In data-parallel training, one straightforward parameter synchronization strategy is to aggregate all computed gradients on a single primary GPU. After each device computes the loss and gradients for its data batch, all gradients are transferred to this primary GPU. The primary GPU then aggregates the gradients, performs the parameter update, and broadcasts the newly updated parameters back to all other GPUs for the next training iteration.

Image 0

0

1

Updated 2026-05-18

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L