Deep learning frameworks can automatically parallelize independent computations across multiple GPUs without requiring explicit multi-threading or scheduling code from the user. This automatic parallelism is a direct consequence of asynchronous execution: when the frontend issues operations targeting different GPU devices sequentially, these operations are placed into separate backend queues for each device. Because no data dependency exists between operations on different devices, the backend processes them concurrently. However, if a synchronization barrier (such as torch.cuda.synchronize() or npx.waitall()) is inserted between the operations on the two devices, it forces the first device's work to complete before the second device's work begins, serializing execution and preventing parallelism.

Claude

By default, operations in deep learning frameworks are executed asynchronously in the backend. When a user issues a command via a frontend language (such as Python), the task is immediately placed into a backend queue, and the frontend instantly regains control without waiting for the computation to finish. This design allows the frontend thread to continue executing subsequent statements quickly, ensuring that the frontend language's performance overhead does not bottleneck the heavy computations being processed simultaneously on hardware accelerators like GPUs.

Asynchronous Execution in Deep Learning Frameworks

In a data parallelism setup, the training process follows a specific sequence during each iteration. First, a random minibatch of training data is split into $$k$$ equal portions and distributed across the $$k$$ available GPUs. Each GPU then independently calculates the loss and the local gradients of the model parameters using its assigned data subset. These local gradients from all $$k$$ GPUs are subsequently aggregated (via an allreduce operation) to compute the overall minibatch stochastic gradient. This aggregated gradient is redistributed to every GPU, and each GPU uses it to independently update its complete set of model parameters, ensuring synchronization across the system.

Data Parallelism Training Process

Dive into Deep Learning

In the MXNet framework, the command `npx.waitall()` acts as a global synchronization barrier. When invoked, it forces the Python frontend to halt execution and wait until every pending operation in the backend queue has completely finished, regardless of when those compute instructions were originally issued. While this ensures all results are available, using such a global barrier is generally discouraged unless absolutely necessary, as it severely disrupts asynchronous execution and can lead to poor overall system performance.

Global Synchronization in MXNet

Instead of halting all operations globally, MXNet allows for targeted synchronization by blocking execution only until a specific variable is computed. This is achieved by calling the wait_to_read() method on a specific tensor, such as z.wait_to_read(). In this scenario, the framework blocks the return of control to the Python frontend only until that particular variable's result is available, while permitting other unrelated background computations in the backend queue to continue processing simultaneously.

Variable-Specific Synchronization in MXNet

Beyond explicit synchronization commands, deep learning frameworks contain implicit blockers that force the frontend to wait for backend computations to complete. Any operation that requires direct access to a variable's underlying value acts as a blocker because the framework cannot proceed until that specific value is fully computed and available. Common examples of implicit blockers include invoking the print function on a tensor, converting a tensor to a scalar value using methods like item(), or explicitly converting a tensor to a NumPy array via methods like asnumpy(). These operations implicitly stall the backend because environments like standard Python and libraries like NumPy lack built-in notions of asynchrony and strictly demand the final resolved numerical result before proceeding.

Implicit Blockers in Deep Learning Frameworks

In PyTorch, developers can explicitly force the system to complete all pending backend computations before returning control to the frontend by utilizing a synchronization barrier. Specifically, calling `torch.cuda.synchronize(device)` blocks the Python frontend thread until every operation queued on the designated GPU device has finished executing. This global synchronization is essential for tasks such as precise performance benchmarking; without it, measured execution times would incorrectly reflect only the negligible delay of adding tasks to the backend queue, rather than the true computational duration.

Global Synchronization in PyTorch

To demonstrate the effects of asynchronous execution, consider a warmup toy problem that generates a random $$1000 	imes 1000$$ matrix and multiplies it by itself. When benchmarking this matrix multiplication in a deep learning framework like PyTorch or MXNet against NumPy, the framework's output appears to be orders of magnitude faster. While GPU execution provides significant speedup, the massive time difference primarily occurs because the framework's operations are asynchronous: the backend executes the computation while the frontend immediately returns control to Python. Accurate benchmarking requires forcing the framework to finish all backend computations prior to returning the measured time, revealing the true execution duration.

Example of Asynchronous Benchmarking

On heavily multithreaded systems—ranging from standard laptops with $$4$$ or more threads to multi-socket servers exceeding $$256$$ threads—the overhead of scheduling computational operations can become a significant performance bottleneck. Each operation dispatched to the backend must be placed in a queue, prioritized, and routed to an available thread, and this bookkeeping cost grows with system concurrency. To mitigate this overhead, it is highly desirable for computation and scheduling to proceed asynchronously and in parallel, so that the frontend can rapidly enqueue work while the backend processes it concurrently, rather than serializing every operation through a synchronous round-trip.

Scheduling Overhead in Multithreaded Deep Learning Systems

While asynchronous execution keeps the Python frontend highly responsive by allowing it to continuously enqueue operations without waiting, this responsiveness introduces a risk: if the frontend submits work faster than the backend can process it, the task queue grows unboundedly, leading to excessive memory consumption. To prevent such overflow, it is recommended to insert a synchronization barrier after each minibatch during training. This per-minibatch synchronization forces the frontend to pause briefly while the backend catches up, keeping the two approximately in step and bounding the queue's memory footprint without sacrificing the major throughput advantages of asynchronous execution.

Minibatch Synchronization to Prevent Task Queue Overflow

Hardware chip manufacturers provide sophisticated performance analysis and profiling tools designed to give deep learning practitioners fine-grained insight into the computational efficiency of their models. These vendor-supplied utilities go beyond simple timing measurements, enabling detailed examination of how operations are scheduled, how hardware resources are utilized, and where bottlenecks occur during training and inference on specialized accelerators.

Chip Vendor Performance Analysis Tools for Deep Learning

Automatic Multi-GPU Parallelism via Asynchronous Execution

The `train_batch` function implements data-parallel training for a single minibatch across multiple GPUs. The procedure follows four sequential stages:

1. **Data splitting:** The minibatch of features and labels is divided across the available devices using `split_batch`.
2. **Per-GPU forward pass and loss:** Each GPU independently computes the model's output and the loss on its local data shard. The losses are summed per device.
3. **Per-GPU backpropagation:** Backpropagation is performed separately on each GPU to compute local gradients.
4. **Gradient synchronization and update:** An `allreduce` operation sums and broadcasts all gradients across GPUs within a `torch.no_grad()` context. Finally, each GPU independently updates its own copy of the model parameters using SGD, scaling the update by the full (unsplit) batch size.

```python
def train_batch(X, y, device_params, devices, lr):
    X_shards, y_shards = split_batch(X, y, devices)
    ls = [loss(lenet(X_shard, device_W), y_shard).sum()
          for X_shard, y_shard, device_W in zip(
              X_shards, y_shards, device_params)]
    for l in ls:
        l.backward()
    with torch.no_grad():
        for i in range(len(device_params[0])):
            allreduce([device_params[c][i].grad
                       for c in range(len(devices))])
    for param in device_params:
        d2l.sgd(param, lr, X.shape[0])
```

Because there are no cross-device dependencies within the computational graph for a single minibatch, the per-GPU computations execute in parallel automatically.

Multi-GPU Minibatch Training Implementation

Distributed parallel training across multiple machines involves a synchronized sequence of operations to manage parameters over a comparatively lower bandwidth network fabric. The process follows seven steps: First, a batch of data is read on each machine, split, and transferred to local GPUs where predictions and gradients are computed separately. Second, the local gradients are aggregated (or partially aggregated) on the GPUs within each machine. Third, these aggregated gradients are sent from the GPUs to the machine's CPUs. Fourth, the CPUs transmit the gradients over the network to a central parameter server, which aggregates all gradients across the different machines. Fifth, the central server uses the aggregate gradients to update the parameters and broadcasts them back to the individual CPUs. Sixth, the CPUs send the updated parameters to one or more local GPUs. Finally, the parameters are spread across all GPUs on the machine for the next iteration.

Multi-Machine Distributed Parallel Training Process

To train the image classification model and select optimal hyperparameters based on validation set performance, a comprehensive training function is defined. This function orchestrates the training loop over a specified number of epochs using data parallelism across multiple devices. It initializes an optimization algorithm, such as stochastic gradient descent (SGD) with momentum and weight decay, and incorporates a step learning rate scheduler to periodically decay the learning rate by a specified factor. Within each epoch, the function iterates through training mini-batches to update parameters and accumulate training loss and accuracy. If a validation iterator is provided, it evaluates the model's accuracy on the hold-out validation set, plotting these metrics dynamically to monitor the model's generalization progress.

Learn Before

Related