Learn Before
Concept
Single-GPU Gradient Aggregation
In data-parallel training, one straightforward parameter synchronization strategy is to aggregate all computed gradients on a single primary GPU. After each device computes the loss and gradients for its data batch, all gradients are transferred to this primary GPU. The primary GPU then aggregates the gradients, performs the parameter update, and broadcasts the newly updated parameters back to all other GPUs for the next training iteration.
0
1
Updated 2026-05-18
Tags
D2L
Dive into Deep Learning @ D2L