Learn Before
Multi-Machine Distributed Parallel Training Process
Distributed parallel training across multiple machines involves a synchronized sequence of operations to manage parameters over a comparatively lower bandwidth network fabric. The process follows seven steps: First, a batch of data is read on each machine, split, and transferred to local GPUs where predictions and gradients are computed separately. Second, the local gradients are aggregated (or partially aggregated) on the GPUs within each machine. Third, these aggregated gradients are sent from the GPUs to the machine's CPUs. Fourth, the CPUs transmit the gradients over the network to a central parameter server, which aggregates all gradients across the different machines. Fifth, the central server uses the aggregate gradients to update the parameters and broadcasts them back to the individual CPUs. Sixth, the CPUs send the updated parameters to one or more local GPUs. Finally, the parameters are spread across all GPUs on the machine for the next iteration.
0
1
Tags
D2L
Dive into Deep Learning @ D2L