Learn Before
Concept
CPU-Based Gradient Aggregation
In multi-GPU training, it is possible to use the central CPU to aggregate gradients instead of a specific GPU. However, because CPUs typically lack sufficient direct PCIe lanes to connect to all GPUs, data must travel through a multiplexer switch. This architecture creates a communication bottleneck, as each GPU must send its gradients to the CPU individually, incurring a significant bandwidth penalty and resulting in synchronization times that are often much slower than direct GPU-to-GPU communication.
0
1
Updated 2026-05-18
Tags
D2L
Dive into Deep Learning @ D2L