Network partitioning, also known as layer-wise model parallelism, is a multiple-GPU training strategy where the neural network is divided sequentially across devices. Each GPU takes the input for a specific set of layers, processes it, and transfers the intermediate activations to the next GPU. While this controls the memory footprint per GPU and allows for the training of larger networks, it introduces significant bottlenecks. The interfaces between layers require tight synchronization and massive data transfers of activations and gradients, which can easily overwhelm GPU bus bandwidth. Furthermore, ensuring that sequential computational workloads are evenly matched between layers is highly difficult, making linear scaling challenging to achieve.

Network Partitioning

Layerwise partitioning, commonly known as tensor parallelism, is a multiple-GPU training strategy that splits the computational work within individual network layers across different devices. For example, instead of computing all channels of a convolutional layer on a single GPU, the workload can be distributed so that multiple GPUs each compute a fraction of the channels. This approach scales effectively in terms of computation and enables the processing of larger networks by pooling the memory of multiple GPUs. However, because each layer's computation depends on the aggregated results from all other participating GPUs, this strategy requires a massive number of synchronization barriers and incurs exorbitant bandwidth costs for data transfer.

Layerwise Partitioning

Data parallelism is a widely used and highly convenient strategy for distributing deep learning training across multiple GPUs. In this approach, every GPU maintains a complete replica of the model and performs the identical sequence of operations, but each processes a different subset of the training minibatch. After each minibatch, the independently computed gradients are aggregated across all GPUs to synchronize and update the model parameters. To maximize efficiency, it is highly desirable to overlap computation and communication by exchanging gradients for some parameters while others are still being computed. While data parallelism enables larger effective minibatch sizes and increases overall training throughput, it is ultimately constrained by the memory of a single GPU and does not facilitate the training of larger models.

Data Parallelism

When training deep neural networks on multiple GPUs, the computational workload and memory requirements must be distributed to achieve efficiency and overcome hardware limits. The three primary parallelization strategies are network partitioning (distributing subsequent layers across different GPUs), layerwise partitioning (splitting the operations within a single layer across multiple GPUs), and data parallelism (partitioning the training data across GPUs while maintaining a full model replica on each). By and large, data parallelism is the most convenient and widely used approach, provided the GPUs have sufficiently large memory to hold the model.

Claude

Graphics Processing Units (GPUs) fundamentally transformed deep learning by providing the immense computational throughput required to train deep neural networks. Originally developed to accelerate computer graphics by rapidly performing $$4 	imes 4$$ matrix-vector products, this highly parallel architecture proved perfectly suited for the dense linear algebra operations and convolutions inherent in neural networks. The development of early GPU-accelerated convolution libraries, such as cuda-convnet developed by Alex Krizhevsky and Ilya Sutskever for two NVIDIA GTX 580s, enabled the training of massive models like AlexNet. This hardware breakthrough made deep, data-hungry architectures computationally feasible, igniting the modern deep learning boom.

Graphics Processing Unit (GPU) in Deep Learning

Dive into Deep Learning

The architectural differences between Central Processing Units (CPUs) and Graphics Processing Units (GPUs) explain why GPUs dominate deep learning computation. A CPU features a small number of highly powerful, complex cores designed for general-purpose computing, dedicating significant silicon area to sophisticated control flow like branch prediction. This makes them ideal for executing sequential code but less efficient for massive parallelism. In contrast, a GPU comprises thousands of simpler, weaker cores running at lower clock frequencies. Because power consumption grows quadratically with clock speed, using many slower cores is vastly more energy-efficient than a few fast ones. This high-throughput parallel design, combined with exceptionally wide memory buses, allows GPUs to process the massive matrix multiplications required in deep neural networks orders of magnitude faster than CPUs.

CPU vs. GPU Architecture in Deep Learning

Introduced in 2012 by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, AlexNet was a pioneering $$8$$-layer Convolutional Neural Network (CNN) that won the 2012 ImageNet Large Scale Visual Recognition Challenge by a large margin. It demonstrated for the first time that features obtained through automatic learning could transcend manually-designed features, effectively breaking the previous paradigm in computer vision. Structurally, it is an evolutionary improvement over the earlier LeNet-5 architecture, sharing many architectural elements but scaled up significantly to leverage massive training datasets and faster Graphics Processing Units (GPUs). The network processes input images through a deep hierarchy of convolutional layers, max-pooling layers, and fully connected layers to generate predictions.

AlexNet Convolutional Neural Network

The cuda-convnet library is a highly optimized implementation of deep convolutional neural networks designed to run on GPUs, developed by Alex Krizhevsky and Ilya Sutskever. By efficiently parallelizing convolutions and matrix multiplications on hardware, it served as the industry standard for several years and powered the initial boom in deep learning.

cuda-convnet

General-Purpose GPUs (GPGPUs) are graphics processing units that have been optimized by chip manufacturers to handle general computing operations beyond traditional graphics rendering. By supporting high-throughput matrix-vector products, these hardware architectures provided the crucial computational power required to train deep neural networks effectively.

General-Purpose GPUs (GPGPUs)

In deep learning environments, the hardware configuration of Graphics Processing Units (GPUs) varies depending on computational requirements and physical constraints. These accelerators are typically connected to the central system via a high-speed expansion bus, such as Peripheral Component Interconnect Express (PCIe). High-end servers designed for intensive model training often deploy up to $$8$$ GPUs connected in an advanced topology to maximize parallel processing capabilities. Conversely, standard desktop systems generally accommodate $$1$$ or $$2$$ GPUs, with the exact setup limited by the user's budget and the capacity of the system's power supply.

GPU Hardware Configurations in Deep Learning

Because GPUs possess significantly more processing elements than CPUs, they require vastly higher memory bandwidth to avoid starving their compute units. To satisfy this demand, GPU architectures employ two primary strategies. First, they utilize much wider memory buses, such as the 352-bit-wide bus found on NVIDIA’s RTX 2080 Ti. Second, they rely on specialized high-performance memory chips. Consumer-grade accelerators typically use GDDR6 modules, offering over 500 GB/s of aggregate bandwidth. In contrast, high-end server accelerators, like the NVIDIA Volta V100, use High Bandwidth Memory (HBM). HBM modules connect directly to the GPU on a dedicated silicon wafer, delivering massive speed but significantly increasing manufacturing costs. Consequently, while GPU memory is functionally similar to CPU memory, it is substantially faster but generally much smaller in capacity.

High-Bandwidth GPU Memory Technologies

Hardware accelerators optimized for deep learning inference are designed specifically to compute the forward propagation of a neural network. Because no intermediate data needs to be stored for backpropagation, these devices require significantly less memory capacity. Furthermore, inference tasks can typically tolerate lower numerical precision without heavily impacting predictions, allowing these accelerators to efficiently utilize formats like FP16 or INT8. For example, NVIDIA's Turing T4 GPUs are specifically tailored for these streamlined inference workloads.

Hardware Accelerators for Inference

Hardware accelerators optimized for deep learning training must handle significantly higher computational and memory demands than inference devices. During training, all intermediate activations must be stored in memory to compute gradients during backpropagation. Additionally, accumulating gradients requires higher numerical precision—at minimum FP16 or mixed precision with FP32—to avoid issues like numerical underflow or overflow. Consequently, training accelerators (such as NVIDIA V100 GPUs) require vastly faster and larger memory technologies (e.g., HBM2 as opposed to GDDR6) and greater overall processing power.

Hardware Accelerators for Training

The NVIDIA Collective Communications Library (NCCL) is a software protocol specifically recommended for achieving highly efficient, optimized data transfers between multiple GPUs. By operating over high-speed hardware interconnects like NVLink and PCIe, NCCL streamlines multi-GPU communication to rapidly synchronize processing during deep learning model training.

Learn Before

Related

Learn After