Learn Before
Communication Overhead Examples in Data-Parallel Training
The time required to synchronize parameters during data-parallel training varies significantly depending on the aggregation strategy, primarily due to hardware bandwidth constraints. For instance, on a -way GPU server with a GB/s connection, synchronizing a MB gradient using a single-GPU aggregation strategy takes ms ( ms to send gradients to one GPU, and ms to broadcast the updated weights back). If the same aggregation is performed on the CPU, the overhead increases to ms because each GPU must independently send and receive data from the central processor. However, if the gradients are partitioned into four MB segments and distributed across all four GPUs for simultaneous aggregation, the full-bandwidth capabilities of the PCIe switch reduce the synchronization time to just ms. This quantitative difference demonstrates why distributed gradient aggregation is vastly superior to single-GPU or CPU-based methods in practice.
0
1
Tags
D2L
Dive into Deep Learning @ D2L