To optimally synchronize data across GPUs interconnected via NVLink, the network connectivity can be decomposed into distinct ring structures. For example, an $$8$$-GPU NVLink network can be organized into two separate rings: one ring utilizing double NVLink bandwidth and a second ring using regular bandwidth. This decomposition strategy avoids the bottleneck of the PCIe bus by allowing data to be synchronized directly between GPUs, maximizing the utilization of the high-speed aggregate bandwidth provided by the NVLink connections.

Decomposition of NVLink Networks into Rings

Modern deep learning hardware often features bespoke network connectivity to handle large data transfers efficiently. For example, in an $$8$$-GPU server, each GPU typically connects to a host CPU via a PCIe link operating at around $$16$$ GB/s. Simultaneously, each GPU may have multiple NVLink connections to other GPUs, each capable of bidirectionally transferring data at much higher speeds (e.g., $$300$$ Gbit/s or roughly $$18$$ GB/s per direction). Because the aggregate NVLink bandwidth significantly exceeds the PCIe bandwidth, maximizing training efficiency requires specialized synchronization protocols that exploit this hardware architecture.

Claude

Efficient multi-GPU training relies on two foundational data synchronization operations. First, parameters must be distributed to multiple devices and gradients must be attached, because without parameters it is impossible to evaluate the network on a GPU. Second, an allreduce function is required to sum parameters across multiple devices and broadcast the result back, ensuring consistency.

Data Synchronization in Multi-GPU Training

Dive into Deep Learning

In data-parallel training, after gradients are computed across multiple devices, they must be synchronized to update the model parameters. This synchronization can be implemented using centralized strategies, where all gradients are sent to a single GPU or the CPU for aggregation, or through distributed strategies, where gradients are partitioned and aggregated simultaneously across multiple GPUs to leverage the full bandwidth of hardware switches.

Parameter Synchronization Strategies

In deep neural networks, gradients are computed sequentially from the output layers back to the input layers during backpropagation. To improve distributed training performance, systems can begin synchronizing the gradients of the already-processed upper layers while the lower layers are still computing their gradients. This overlapping of communication and computation minimizes hardware idle time and accelerates the overall training iteration.

Learn Before

Related

Learn After