Concept

Overlapping Computation and Communication

In distributed deep learning, transferring data between devices (such as moving gradients from a GPU to the CPU) utilizes the system bus (e.g., PCI-Express), which is a hardware resource distinct from the computational units. Because these resources are separate, frameworks can overlap computation and communication to significantly reduce total execution time. For example, during backpropagation on a minibatch, some parameter gradients become available earlier than others. A system can begin copying the computed gradients over the bus while the GPU concurrently processes the remaining gradients, since copying y[i1]y[i-1] does not conflict with computing y[i]y[i].

0

1

Updated 2026-05-18

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L