To overcome the bandwidth bottleneck of a single central parameter server in multi-machine training, the system can be scaled by distributing the parameter synchronization workload across multiple servers. By increasing the number of parameter servers to $$n$$, each server is responsible for storing and updating only a fraction of the parameters, specifically $$\mathcal{O}(1/n)$$. Consequently, the total time required for parameter updates and optimization across $$m$$ workers is reduced to $$\mathcal{O}(m/n)$$. In practice, to achieve constant scaling time regardless of the total number of workers, systems often use the exact same machines simultaneously as both workers and servers.

Scaling with Multiple Parameter Servers

In multi-machine distributed training, utilizing a single central parameter server creates a significant bandwidth bottleneck. Because the network bandwidth per server is finite and comparatively low, all machines must communicate with this single central point to synchronize gradients and receive updated parameters. If there are $$m$$ worker machines, the time required to send all gradients to the central server scales linearly, resulting in an update time of $$\mathcal{O}(m)$$. This bottleneck severely limits the scalability of synchronous distributed optimization, as the central server cannot efficiently handle the data transfer demands of many simultaneous workers.

Claude

Distributed parallel training across multiple machines involves a synchronized sequence of operations to manage parameters over a comparatively lower bandwidth network fabric. The process follows seven steps: First, a batch of data is read on each machine, split, and transferred to local GPUs where predictions and gradients are computed separately. Second, the local gradients are aggregated (or partially aggregated) on the GPUs within each machine. Third, these aggregated gradients are sent from the GPUs to the machine's CPUs. Fourth, the CPUs transmit the gradients over the network to a central parameter server, which aggregates all gradients across the different machines. Fifth, the central server uses the aggregate gradients to update the parameters and broadcasts them back to the individual CPUs. Sixth, the CPUs send the updated parameters to one or more local GPUs. Finally, the parameters are spread across all GPUs on the machine for the next iteration.

Learn Before

Related

Learn After