In data-parallel training, one straightforward parameter synchronization strategy is to aggregate all computed gradients on a single primary GPU. After each device computes the loss and gradients for its data batch, all gradients are transferred to this primary GPU. The primary GPU then aggregates the gradients, performs the parameter update, and broadcasts the newly updated parameters back to all other GPUs for the next training iteration.

Single-GPU Gradient Aggregation

In multi-GPU training, it is possible to use the central CPU to aggregate gradients instead of a specific GPU. However, because CPUs typically lack sufficient direct PCIe lanes to connect to all GPUs, data must travel through a multiplexer switch. This architecture creates a communication bottleneck, as each GPU must send its gradients to the CPU individually, incurring a significant bandwidth penalty and resulting in synchronization times that are often much slower than direct GPU-to-GPU communication.

CPU-Based Gradient Aggregation

To maximize bandwidth efficiency in multi-GPU training, distributed gradient aggregation splits the model's gradients into smaller partitions and aggregates each partition simultaneously on a different GPU. Because modern PCIe switches allow for full-bandwidth operation across all connected links concurrently, this distributed approach significantly reduces the total time required for parameter synchronization compared to sending all gradients to a single centralized GPU or CPU.

Distributed Gradient Aggregation

The ring synchronization algorithm efficiently aggregates gradients across a ring of $$n$$ computing nodes without the time cost growing linearly. Instead of sending the full gradient sequentially from node to node—which would leave most nodes idle—the gradient is divided into $$n$$ distinct chunks. The algorithm simultaneously begins synchronizing chunk $$i$$ starting at node $$i$$. Because each node transmits only a $$1/n$$ fraction of the total gradient at any given time, all nodes communicate in parallel. After $$n-1$$ steps, the total time spent aggregating the gradients is proportional to $$(n-1)/n \approx 1$$, meaning the synchronization time remains approximately constant regardless of the ring size.

Ring Synchronization Algorithm

Implementing the synchronization steps required for distributed multi-GPU training in practice is nontrivial and complex. To manage this, frameworks use a common abstraction, namely that of a key-value store with redefined update semantics. By hiding the complexity of distributed synchronization behind simple push and pull operations, this abstraction decouples the concerns of statistical modelers—who express optimization in simple terms—from system engineers dealing with distributed hardware.

Key-Value Store Abstraction for Distributed Training

The time required to synchronize parameters during distributed training can vary significantly based on the underlying hardware. Consequently, parameter synchronization strategies must be highly adaptive to both the broader network infrastructure and the specific internal connectivity within a given server to minimize communication delays and maximize overall training efficiency.

Adaptivity of Parameter Synchronization to Network Infrastructure

In data-parallel training, after gradients are computed across multiple devices, they must be synchronized to update the model parameters. This synchronization can be implemented using centralized strategies, where all gradients are sent to a single GPU or the CPU for aggregation, or through distributed strategies, where gradients are partitioned and aggregated simultaneously across multiple GPUs to leverage the full bandwidth of hardware switches.

Claude

Efficient multi-GPU training relies on two foundational data synchronization operations. First, parameters must be distributed to multiple devices and gradients must be attached, because without parameters it is impossible to evaluate the network on a GPU. Second, an allreduce function is required to sum parameters across multiple devices and broadcast the result back, ensuring consistency.

Data Synchronization in Multi-GPU Training

Dive into Deep Learning

Parameter Synchronization Strategies

In deep neural networks, gradients are computed sequentially from the output layers back to the input layers during backpropagation. To improve distributed training performance, systems can begin synchronizing the gradients of the already-processed upper layers while the lower layers are still computing their gradients. This overlapping of communication and computation minimizes hardware idle time and accelerates the overall training iteration.

Overlapping Gradient Computation and Synchronization

Modern deep learning hardware often features bespoke network connectivity to handle large data transfers efficiently. For example, in an $$8$$-GPU server, each GPU typically connects to a host CPU via a PCIe link operating at around $$16$$ GB/s. Simultaneously, each GPU may have multiple NVLink connections to other GPUs, each capable of bidirectionally transferring data at much higher speeds (e.g., $$300$$ Gbit/s or roughly $$18$$ GB/s per direction). Because the aggregate NVLink bandwidth significantly exceeds the PCIe bandwidth, maximizing training efficiency requires specialized synchronization protocols that exploit this hardware architecture.

Learn Before

Related

Learn After