Learn Before
Concept
Distributed Gradient Aggregation
To maximize bandwidth efficiency in multi-GPU training, distributed gradient aggregation splits the model's gradients into smaller partitions and aggregates each partition simultaneously on a different GPU. Because modern PCIe switches allow for full-bandwidth operation across all connected links concurrently, this distributed approach significantly reduces the total time required for parameter synchronization compared to sending all gradients to a single centralized GPU or CPU.
0
1
Updated 2026-05-18
Tags
D2L
Dive into Deep Learning @ D2L