Learn Before
Concept

CPU-Based Gradient Aggregation

In multi-GPU training, it is possible to use the central CPU to aggregate gradients instead of a specific GPU. However, because CPUs typically lack sufficient direct PCIe lanes to connect to all GPUs, data must travel through a multiplexer switch. This architecture creates a communication bottleneck, as each GPU must send its gradients to the CPU individually, incurring a significant bandwidth penalty and resulting in synchronization times that are often much slower than direct GPU-to-GPU communication.

Image 0

0

1

Updated 2026-05-18

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L