Using a key-value store abstraction for distributed training allows for the management of many sets of gradients by indexing them with a key $$i$$. This abstraction defines two main operations:

- push(key, value): Sends a particular gradient (the value) from a worker to a common storage where it is aggregated (e.g., by summation).
- pull(key, value): Retrieves the final aggregate gradient value from the common storage after combining the inputs from all workers.

This architecture shares characteristics with distributed key-value stores like Dynamo, facilitating efficient parameter distribution across multiple servers.

Push and Pull Operations in Distributed Training

Amazon's Dynamo serves as a foundational real-world example of a distributed key-value store whose architecture parallels the parameter synchronization mechanisms used in deep learning. Because neural networks have many different layers, their respective sets of gradients must be indexed using a key $$i$$. This indexing strategy intentionally mirrors systems like Dynamo, which also utilize key-based structures to efficiently distribute and manage parameters across multiple servers.

Dynamo Key-Value Store Example

Implementing the synchronization steps required for distributed multi-GPU training in practice is nontrivial and complex. To manage this, frameworks use a common abstraction, namely that of a key-value store with redefined update semantics. By hiding the complexity of distributed synchronization behind simple push and pull operations, this abstraction decouples the concerns of statistical modelers—who express optimization in simple terms—from system engineers dealing with distributed hardware.

Claude

In data-parallel training, after gradients are computed across multiple devices, they must be synchronized to update the model parameters. This synchronization can be implemented using centralized strategies, where all gradients are sent to a single GPU or the CPU for aggregation, or through distributed strategies, where gradients are partitioned and aggregated simultaneously across multiple GPUs to leverage the full bandwidth of hardware switches.

Parameter Synchronization Strategies

Dive into Deep Learning

In data-parallel training, one straightforward parameter synchronization strategy is to aggregate all computed gradients on a single primary GPU. After each device computes the loss and gradients for its data batch, all gradients are transferred to this primary GPU. The primary GPU then aggregates the gradients, performs the parameter update, and broadcasts the newly updated parameters back to all other GPUs for the next training iteration.

Single-GPU Gradient Aggregation

In multi-GPU training, it is possible to use the central CPU to aggregate gradients instead of a specific GPU. However, because CPUs typically lack sufficient direct PCIe lanes to connect to all GPUs, data must travel through a multiplexer switch. This architecture creates a communication bottleneck, as each GPU must send its gradients to the CPU individually, incurring a significant bandwidth penalty and resulting in synchronization times that are often much slower than direct GPU-to-GPU communication.

CPU-Based Gradient Aggregation

To maximize bandwidth efficiency in multi-GPU training, distributed gradient aggregation splits the model's gradients into smaller partitions and aggregates each partition simultaneously on a different GPU. Because modern PCIe switches allow for full-bandwidth operation across all connected links concurrently, this distributed approach significantly reduces the total time required for parameter synchronization compared to sending all gradients to a single centralized GPU or CPU.

Distributed Gradient Aggregation

The ring synchronization algorithm efficiently aggregates gradients across a ring of $$n$$ computing nodes without the time cost growing linearly. Instead of sending the full gradient sequentially from node to node—which would leave most nodes idle—the gradient is divided into $$n$$ distinct chunks. The algorithm simultaneously begins synchronizing chunk $$i$$ starting at node $$i$$. Because each node transmits only a $$1/n$$ fraction of the total gradient at any given time, all nodes communicate in parallel. After $$n-1$$ steps, the total time spent aggregating the gradients is proportional to $$(n-1)/n \approx 1$$, meaning the synchronization time remains approximately constant regardless of the ring size.

Ring Synchronization Algorithm

Key-Value Store Abstraction for Distributed Training

The time required to synchronize parameters during distributed training can vary significantly based on the underlying hardware. Consequently, parameter synchronization strategies must be highly adaptive to both the broader network infrastructure and the specific internal connectivity within a given server to minimize communication delays and maximize overall training efficiency.

Learn Before

Related

Learn After