Learn Before
Example

Communication Overhead Examples in Data-Parallel Training

The time required to synchronize parameters during data-parallel training varies significantly depending on the aggregation strategy, primarily due to hardware bandwidth constraints. For instance, on a 44-way GPU server with a 1616 GB/s connection, synchronizing a 160160 MB gradient using a single-GPU aggregation strategy takes 6060 ms (3030 ms to send gradients to one GPU, and 3030 ms to broadcast the updated weights back). If the same aggregation is performed on the CPU, the overhead increases to 8080 ms because each GPU must independently send and receive data from the central processor. However, if the gradients are partitioned into four 4040 MB segments and distributed across all four GPUs for simultaneous aggregation, the full-bandwidth capabilities of the PCIe switch reduce the synchronization time to just 1515 ms. This quantitative difference demonstrates why distributed gradient aggregation is vastly superior to single-GPU or CPU-based methods in practice.

0

1

Updated 2026-05-18

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L