Learn Before
Concept

Distributed Gradient Aggregation

To maximize bandwidth efficiency in multi-GPU training, distributed gradient aggregation splits the model's gradients into smaller partitions and aggregates each partition simultaneously on a different GPU. Because modern PCIe switches allow for full-bandwidth operation across all connected links concurrently, this distributed approach significantly reduces the total time required for parameter synchronization compared to sending all gradients to a single centralized GPU or CPU.

0

1

Updated 2026-05-18

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L