The `train_batch` function implements data-parallel training for a single minibatch across multiple GPUs. The procedure follows four sequential stages:

1. **Data splitting:** The minibatch of features and labels is divided across the available devices using `split_batch`.
2. **Per-GPU forward pass and loss:** Each GPU independently computes the model's output and the loss on its local data shard. The losses are summed per device.
3. **Per-GPU backpropagation:** Backpropagation is performed separately on each GPU to compute local gradients.
4. **Gradient synchronization and update:** An `allreduce` operation sums and broadcasts all gradients across GPUs within a `torch.no_grad()` context. Finally, each GPU independently updates its own copy of the model parameters using SGD, scaling the update by the full (unsplit) batch size.

```python
def train_batch(X, y, device_params, devices, lr):
    X_shards, y_shards = split_batch(X, y, devices)
    ls = [loss(lenet(X_shard, device_W), y_shard).sum()
          for X_shard, y_shard, device_W in zip(
              X_shards, y_shards, device_params)]
    for l in ls:
        l.backward()
    with torch.no_grad():
        for i in range(len(device_params[0])):
            allreduce([device_params[c][i].grad
                       for c in range(len(devices))])
    for param in device_params:
        d2l.sgd(param, lr, X.shape[0])
```

Because there are no cross-device dependencies within the computational graph for a single minibatch, the per-GPU computations execute in parallel automatically.

Multi-GPU Minibatch Training Implementation

Deep learning frameworks can automatically parallelize independent computations across multiple GPUs without requiring explicit multi-threading or scheduling code from the user. This automatic parallelism is a direct consequence of asynchronous execution: when the frontend issues operations targeting different GPU devices sequentially, these operations are placed into separate backend queues for each device. Because no data dependency exists between operations on different devices, the backend processes them concurrently. However, if a synchronization barrier (such as torch.cuda.synchronize() or npx.waitall()) is inserted between the operations on the two devices, it forces the first device's work to complete before the second device's work begins, serializing execution and preventing parallelism.

Automatic Multi-GPU Parallelism via Asynchronous Execution

Distributed parallel training across multiple machines involves a synchronized sequence of operations to manage parameters over a comparatively lower bandwidth network fabric. The process follows seven steps: First, a batch of data is read on each machine, split, and transferred to local GPUs where predictions and gradients are computed separately. Second, the local gradients are aggregated (or partially aggregated) on the GPUs within each machine. Third, these aggregated gradients are sent from the GPUs to the machine's CPUs. Fourth, the CPUs transmit the gradients over the network to a central parameter server, which aggregates all gradients across the different machines. Fifth, the central server uses the aggregate gradients to update the parameters and broadcasts them back to the individual CPUs. Sixth, the CPUs send the updated parameters to one or more local GPUs. Finally, the parameters are spread across all GPUs on the machine for the next iteration.

Multi-Machine Distributed Parallel Training Process

To train the image classification model and select optimal hyperparameters based on validation set performance, a comprehensive training function is defined. This function orchestrates the training loop over a specified number of epochs using data parallelism across multiple devices. It initializes an optimization algorithm, such as stochastic gradient descent (SGD) with momentum and weight decay, and incorporates a step learning rate scheduler to periodically decay the learning rate by a specified factor. Within each epoch, the function iterates through training mini-batches to update parameters and accumulate training loss and accuracy. If a validation iterator is provided, it evaluates the model's accuracy on the hold-out validation set, plotting these metrics dynamically to monitor the model's generalization progress.

CIFAR-10 Model Training Function Definition

In a data parallelism setup, the training process follows a specific sequence during each iteration. First, a random minibatch of training data is split into $$k$$ equal portions and distributed across the $$k$$ available GPUs. Each GPU then independently calculates the loss and the local gradients of the model parameters using its assigned data subset. These local gradients from all $$k$$ GPUs are subsequently aggregated (via an allreduce operation) to compute the overall minibatch stochastic gradient. This aggregated gradient is redistributed to every GPU, and each GPU uses it to independently update its complete set of model parameters, ensuring synchronization across the system.

Claude

Data parallelism is a widely used and highly convenient strategy for distributing deep learning training across multiple GPUs. In this approach, every GPU maintains a complete replica of the model and performs the identical sequence of operations, but each processes a different subset of the training minibatch. After each minibatch, the independently computed gradients are aggregated across all GPUs to synchronize and update the model parameters. To maximize efficiency, it is highly desirable to overlap computation and communication by exchanging gradients for some parameters while others are still being computed. While data parallelism enables larger effective minibatch sizes and increases overall training throughput, it is ultimately constrained by the memory of a single GPU and does not facilitate the training of larger models.

Data Parallelism

Dive into Deep Learning

The standard delta rule for gradient descent updates a model's parameters by taking a small step in the direction of the negative loss gradient. The new parameters $$\theta_{t+1}$$ are obtained according to the formula: $$\theta_{t+1} = \theta_t - lr \cdot \frac{\partial L_{\theta_t}(\mathcal{D}_{\mathrm{mini}})}{\partial \theta_t}$$. In this equation, $$\theta_t$$ represents the latest parameters, $$lr$$ is the small step (learning rate), and the fractional term is the gradient of the loss function $$L$$ with respect to $$\theta_t$$, computed on a minibatch of training sample $$\mathcal{D}_{\mathrm{mini}}$$.

Gradient Descent Update Rule

In data parallelism, a minibatch of training sample, $$\mathcal{D}_{\mathrm{mini}}$$, is divided into $$N$$ smaller batches, which can be denoted by $$\mathcal{D}^{1},...,\mathcal{D}^{N}$$. After the division, these smaller batches are distributed to $$N$$ separate workers, each receiving one corresponding batch, allowing them to work at the same time.

Set of Distributed Data Batches in Data Parallelism

Under optimal conditions, data parallelism can significantly accelerate the training process. When worker coordination is efficient and communication overhead is negligible, the training speed can increase by a factor of nearly $$N$$, where $$N$$ is the number of workers. This represents a near-linear speed-up.

Ideal Speed-up in Data Parallelism

A team is training a neural network using a technique where a large batch of data is split equally among 8 machines. Each machine has a full, identical copy of the network model. During a training step, each machine processes its portion of the data and calculates a set of proposed parameter updates. Given this setup, what is the most critical subsequent action to ensure the entire system learns effectively from the full batch of data?

In distributed training using data parallelism, the gradient of the loss function, $$L$$, with respect to the parameters, $$\theta_t$$, for a complete mini-batch, $$\mathcal{D}_{\mathrm{mini}}$$, is computed by summing the gradients from multiple workers. Each worker calculates the gradient on a separate partition of the mini-batch, denoted as $$\mathcal{D}^1, \mathcal{D}^2, \dots, \mathcal{D}^N$$. This aggregation of gradients is represented by the formula:

$$\frac{\partial L_{\theta_t}(\mathcal{D}_{\mathrm{mini}})}{\partial \theta_t} = \underbrace{\frac{\partial L_{\theta_t}(\mathcal{D}^1)}{\partial \theta_t}}_{\textrm{worker 1}} + \underbrace{\frac{\partial L_{\theta_t}(\mathcal{D}^2)}{\partial \theta_t}}_{\textrm{worker 2}} + \cdots + \underbrace{\frac{\partial L_{\theta_t}(\mathcal{D}^N)}{\partial \theta_t}}_{\textrm{worker } N}$$

This allows for parallel computation, significantly speeding up the training process for large models.

Distributed Gradient Calculation

A single training step is performed using a technique where a mini-batch of data is processed in parallel across multiple machines. Each machine holds a complete copy of the model. Arrange the following events in the correct chronological order for one such training step.

A machine learning team is training a large neural network on a massive dataset. To accelerate the process, they employ a strategy where the training data is split across 16 GPUs. Each GPU holds a complete copy of the model and processes its own subset of the data. After each forward and backward pass, the results from all GPUs are combined before updating the model's parameters. The team observes that while using 8 GPUs provided a nearly 8x speed-up compared to a single GPU, scaling to 16 GPUs 

A research team is planning to train a very large neural network. They have access to a cluster of 8 powerful machines, but they discover that the model's full set of parameters is too large to fit into the memory of any single machine. A junior member of the team suggests using a training method where the data is split among the machines to speed up the process. Evaluate this suggestion. Is this a viable strategy for their specific problem? Justify your conclusion based on the fundamental requirements of this training method.

Evaluating a Training Strategy

Your team must train a 30B-parameter LLM on a sing...

You are on-call for an internal LLM training platf...

Your team is training a 70B-parameter LLM on 8 GPU...

You’re advising an internal platform team that mus...

You are the tech lead for training a new LLM that cannot fit on a single GPU due to parameter/activation memory, but leadership also expects near-linear throughput scaling when moving from 8 to 32 GPUs. Your cluster has 32 identical GPUs connected with high-bandwidth intra-node links and slower inter-node links. You must choose a distributed training approach that combines (as needed) data parallelism, model parallelism, pipeline parallelism (with micro-batching), and mixed precision training.

Write a recommendation memo that proposes a concrete parallelism/mixed-precision strategy and justifies it. Your memo must: (1) explain how your design resolves the single-GPU out-of-memory issue, (2) explain where and why gradient synchronization/communication happens and how it affects scaling, (3) explain how pipeline micro-batching changes device utilization compared with naive layer-splitting, and (4) explain how mixed precision improves speed/memory while still keeping training numerically stable (e.g., what stays in higher precision and why). Conclude by identifying the most likely bottleneck that will prevent perfect 32x scaling in your design and one mitigation you would try first.

Designing a Distributed Training Plan Under Memory, Throughput, and Stability Constraints

You are the on-call ML engineer for a corporate LLM fine-tuning job running on 8 GPUs (each 40 GB). The model is too large to fit on a single GPU in full precision, so the team split the model across 4 GPUs in sequential stages (a pipeline). To increase throughput, they also run 2 identical pipeline replicas (so all 8 GPUs are used) and split each global mini-batch across the 2 replicas. They enabled mixed precision so most compute uses FP16, while a master copy of weights is kept in FP32 for updates.

After several hours, the run shows two problems: (1) training loss becomes unstable and occasionally spikes to NaN; (2) GPU utilization is uneven—some GPUs are frequently idle even though the input pipeline is not the bottleneck.

Write a postmortem-style response that (a) identifies the most plausible root causes that connect the chosen parallelism strategy (data parallel across replicas + model/pipeline parallel within a replica) with mixed precision behavior, and (b) proposes a concrete redesign of the training step to address BOTH numerical stability and utilization. Your answer must explicitly explain the interactions/tradeoffs among gradient aggregation across replicas, micro-batching in the pipeline, and FP16/FP32 precision choices (e.g., where precision should be used and why), and justify why your redesign would reduce NaNs while improving end-to-end throughput.

Postmortem and Redesign of a Distributed LLM Training Run with Divergence and Low GPU Utilization

You are the on-call ML engineer for a team training a 30B-parameter LLM on a 64-GPU cluster (8 nodes × 8 GPUs). The model does not fit on a single GPU, so the team shards the model across 4 GPUs per replica (model parallelism) and uses pipeline parallelism with micro-batches to keep those 4 GPUs busy. They then replicate this 4-GPU pipeline across the remaining GPUs using data parallelism, synchronizing gradients across replicas each step. To reduce memory and increase throughput, they enable mixed precision (FP16 compute with FP32 master weights).

After a change request to “increase throughput,” the team doubles the number of data-parallel replicas (more pipelines in parallel) and also increases the number of micro-batches per step. Throughput improves, but two problems appear: (1) scaling efficiency drops sharply (adding replicas yields little additional speed), and (2) training becomes less stable (loss occasionally spikes or diverges).

Write an analysis that identifies the most likely root causes of BOTH problems and proposes a concrete mitigation plan. Your answer must explicitly connect how data parallel gradient synchronization, pipeline micro-batching, model sharding, and mixed precision interact (e.g., communication volume/frequency, pipeline bubbles/latency hiding, effective batch size and update frequency, and numerical stability during gradient aggregation). Conclude by recommending one revised configuration (at a high level) and justify the tradeoffs you are making.

Diagnosing a Scaling Regression in Hybrid Parallel LLM Training

You are the on-call ML platform lead for a company training a 30B-parameter transformer. You have access to two clusters:

- Cluster A: 8 GPUs/node, 80 GB VRAM each, fast NVLink within node, 200 Gbps inter-node network.
- Cluster B: 8 GPUs/node, 40 GB VRAM each, slower interconnect within node, 100 Gbps inter-node network.

The team’s current setup uses pure data parallelism with FP16 mixed precision (FP16 compute, FP32 master weights). On Cluster A it trains stably but is slower than expected; on Cluster B it frequently hits out-of-memory errors unless the global batch size is reduced so much that throughput collapses.

You are asked to propose ONE distributed training configuration that can run on both clusters with minimal code divergence. Your proposal must specify how you will combine (a) data parallelism, (b) model parallelism, (c) pipeline parallelism (including whether you will use micro-batches), and (d) mixed precision choices, and it must justify the key tradeoffs you are making between memory fit, communication overhead, device utilization, and numerical stability.

What configuration do you recommend, and why is it the best compromise given the two clusters’ constraints?

Stabilizing and Scaling an LLM Training Job Across Two GPU Clusters

You are the tech lead for an internal LLM training platform. Your team is moving a 30B-parameter transformer pretraining job to a new cluster: 32 GPUs total, each with 24 GB VRAM, connected with high-bandwidth interconnect. The model does not fit on a single GPU in FP32, and the team wants to maximize tokens/sec while keeping training stable.

Two candidate configurations are proposed:

A) Pure data parallelism across all 32 GPUs, using mixed precision (FP16 compute with FP32 master weights) and gradient all-reduce every step.

B) A hybrid approach: split the model into 4 sequential pipeline stages (pipeline parallelism) with micro-batches, use model parallelism within each stage across 2 GPUs to fit the largest layers, and then use data parallelism across the remaining replicas; also use mixed precision (FP16 compute with FP32 master weights).

During a pilot run, the team observes:
- With A, the job cannot start due to out-of-memory errors even at small batch sizes.
- With B, the job starts, but tokens/sec is lower than expected and some GPUs show periodic idle gaps.

As the decision-maker, which configuration should you choose and what specific adjustment(s) would you make to address the observed issues while preserving numerical stability? In your answer, explicitly connect (1) why the memory constraint rules out or enables a strategy, (2) how data/model/pipeline parallelism interact to affect utilization and communication, and (3) how mixed precision changes both memory headroom and stability requirements.

Choosing a Distributed Training Configuration After a Hardware Refresh

You are the on-call ML engineer for a corporate LLM fine-tuning job that must finish within a weekend. The model is 30B parameters and does not fit in the memory of a single 80GB GPU in full precision. You have access to 8 identical 80GB GPUs connected with high-bandwidth interconnect. The team’s first attempt used pure data parallelism (one full model replica per GPU) and failed with out-of-memory errors. A second attempt split the model across 4 GPUs (layer-wise model parallelism) and ran, but GPU utilization was low because only one stage was busy at a time; throughput was far below target. A third attempt enabled mixed precision (FP16 compute) and ran faster, but training became numerically unstable unless the learning rate was reduced so much that the weekend deadline was missed.

As the incident owner, propose a single revised distributed training design that uses (a) an appropriate combination of data parallelism, model parallelism, and pipeline parallelism, and (b) mixed precision in a way that improves both memory feasibility and throughput while maintaining numerical stability. Your answer must explicitly justify: (1) how your design resolves the original OOM issue, (2) how it avoids the low-utilization problem seen in the sequential model split, and (3) what specific mixed-precision practice(s) you would use to reduce instability while keeping most of the speed/memory benefits.

Selecting a Hybrid Parallelism + Mixed-Precision Strategy for a Memory-Bound LLM Training Run

When distributing training across $$k$$ GPUs using data parallelism, it is standard practice to scale the overall minibatch size by a factor of $$k$$. This ensures that each individual GPU processes the same amount of data, and performs an equivalent computational workload, as it would if training on a single GPU. Because this scaling considerably increases the effective minibatch size—such as a 16-fold increase on a 16-GPU server—the learning rate must typically be increased proportionally to maintain stable and efficient optimization.

Minibatch Scaling in Data Parallelism

Applying batch normalization during multi-GPU data parallelism requires specific architectural adjustments. Because the global minibatch is distributed across multiple devices, computing the exact normalization statistics across the entire batch would necessitate costly cross-device synchronization. A practical solution is to maintain a separate batch normalization coefficient for each GPU, allowing each device to calculate its own mean and variance statistics locally based solely on its assigned subset of the minibatch data.

Batch Normalization in Data Parallelism

Data Parallelism Training Process

Efficient multi-GPU training relies on two foundational data synchronization operations. First, parameters must be distributed to multiple devices and gradients must be attached, because without parameters it is impossible to evaluate the network on a GPU. Second, an allreduce function is required to sum parameters across multiple devices and broadcast the result back, ensuring consistency.

Data Synchronization in Multi-GPU Training

The effectiveness of multi-GPU data parallelism depends critically on the ratio of computation time to synchronization overhead. When a model is computationally lightweight (e.g., LeNet), the time spent on the forward pass and gradient computation is comparable to or smaller than the time required for cross-device parameter synchronization and Python scheduling overhead. In such cases, adding more GPUs yields no meaningful speedup. Conversely, when a model is sufficiently complex (e.g., ResNet-18), the per-device computation time dominates the synchronization cost, making the parallelization overhead relatively negligible and enabling significant scalability improvements as more GPUs are added.

Learn Before

Related

Learn After