1Cademy - Communication Overhead Examples in Data-Parallel Training

Learn Before

Example

Communication Overhead Examples in Data-Parallel Training

The time required to synchronize parameters during data-parallel training varies significantly depending on the aggregation strategy, primarily due to hardware bandwidth constraints. For instance, on a 4-way GPU server with a 16 GB/s connection, synchronizing a 160 MB gradient using a single-GPU aggregation strategy takes 60 ms (30 ms to send gradients to one GPU, and 30 ms to broadcast the updated weights back). If the same aggregation is performed on the CPU, the overhead increases to 80 ms because each GPU must independently send and receive data from the central processor. However, if the gradients are partitioned into four 40 MB segments and distributed across all four GPUs for simultaneous aggregation, the full-bandwidth capabilities of the PCIe switch reduce the synchronization time to just 15 ms. This quantitative difference demonstrates why distributed gradient aggregation is vastly superior to single-GPU or CPU-based methods in practice.

0

1

Updated 2026-06-23

Contributors are:

Who are from:

References

Dive into Deep Learning

Learn Before

Related