To counteract the inefficiencies inherent in model parallelism, such as worker idle time, practical implementations often combine it with other parallelism techniques. This hybrid approach aims to maximize the utilization of computing devices and improve overall training throughput.

Combining Model Parallelism with Other Mechanisms

Pipeline parallelism is a strategy designed to overcome the inefficiency of basic model parallelism, where hardware is underutilized because only one device is active at any given moment. This technique introduces computational overlap by dividing a data batch into smaller units called micro-batches. These micro-batches are fed into a pipeline of workers, allowing a worker to begin processing the next micro-batch as soon as it has passed the current one to the subsequent worker. This creates a continuous flow of computation, ensuring that different devices are working simultaneously on different stages of the process.

Pipeline Parallelism

A research team is training a neural network that is too large to fit into the memory of a single processing unit. To overcome this limitation, they decide to split the network's layers, placing the first set of layers on the first unit, the next set on the second unit, and so on, with the data flowing through them in sequence. Which statement best analyzes how this strategy addresses the memory constraint?

A deep learning team is tasked with training a new language model with 200 billion parameters. They have a cluster of GPUs, but each individual GPU has only 40GB of memory, which is not enough to store the entire model. The team proposes two potential training setups. Evaluate which setup is appropriate for this scenario and justify your reasoning by explaining why the chosen setup works and the other one fails.

Choosing a Parallelism Strategy for a Large Model

A machine learning team is faced with a neural network so large that its parameters cannot be stored in the memory of a single accelerator. Instead of replicating the entire network on multiple devices, they decide to partition the network itself, placing different parts on different accelerators. Explain the primary reason this partitioning approach is necessary and what fundamental resource limitation it directly addresses.

Rationale for Model Partitioning

Your team must train a 30B-parameter LLM on a sing...

You are on-call for an internal LLM training platf...

Your team is training a 70B-parameter LLM on 8 GPU...

You’re advising an internal platform team that mus...

You are the tech lead for training a new LLM that cannot fit on a single GPU due to parameter/activation memory, but leadership also expects near-linear throughput scaling when moving from 8 to 32 GPUs. Your cluster has 32 identical GPUs connected with high-bandwidth intra-node links and slower inter-node links. You must choose a distributed training approach that combines (as needed) data parallelism, model parallelism, pipeline parallelism (with micro-batching), and mixed precision training.

Write a recommendation memo that proposes a concrete parallelism/mixed-precision strategy and justifies it. Your memo must: (1) explain how your design resolves the single-GPU out-of-memory issue, (2) explain where and why gradient synchronization/communication happens and how it affects scaling, (3) explain how pipeline micro-batching changes device utilization compared with naive layer-splitting, and (4) explain how mixed precision improves speed/memory while still keeping training numerically stable (e.g., what stays in higher precision and why). Conclude by identifying the most likely bottleneck that will prevent perfect 32x scaling in your design and one mitigation you would try first.

Designing a Distributed Training Plan Under Memory, Throughput, and Stability Constraints

You are the on-call ML engineer for a corporate LLM fine-tuning job running on 8 GPUs (each 40 GB). The model is too large to fit on a single GPU in full precision, so the team split the model across 4 GPUs in sequential stages (a pipeline). To increase throughput, they also run 2 identical pipeline replicas (so all 8 GPUs are used) and split each global mini-batch across the 2 replicas. They enabled mixed precision so most compute uses FP16, while a master copy of weights is kept in FP32 for updates.

After several hours, the run shows two problems: (1) training loss becomes unstable and occasionally spikes to NaN; (2) GPU utilization is uneven—some GPUs are frequently idle even though the input pipeline is not the bottleneck.

Write a postmortem-style response that (a) identifies the most plausible root causes that connect the chosen parallelism strategy (data parallel across replicas + model/pipeline parallel within a replica) with mixed precision behavior, and (b) proposes a concrete redesign of the training step to address BOTH numerical stability and utilization. Your answer must explicitly explain the interactions/tradeoffs among gradient aggregation across replicas, micro-batching in the pipeline, and FP16/FP32 precision choices (e.g., where precision should be used and why), and justify why your redesign would reduce NaNs while improving end-to-end throughput.

Postmortem and Redesign of a Distributed LLM Training Run with Divergence and Low GPU Utilization

You are the on-call ML engineer for a team training a 30B-parameter LLM on a 64-GPU cluster (8 nodes × 8 GPUs). The model does not fit on a single GPU, so the team shards the model across 4 GPUs per replica (model parallelism) and uses pipeline parallelism with micro-batches to keep those 4 GPUs busy. They then replicate this 4-GPU pipeline across the remaining GPUs using data parallelism, synchronizing gradients across replicas each step. To reduce memory and increase throughput, they enable mixed precision (FP16 compute with FP32 master weights).

After a change request to “increase throughput,” the team doubles the number of data-parallel replicas (more pipelines in parallel) and also increases the number of micro-batches per step. Throughput improves, but two problems appear: (1) scaling efficiency drops sharply (adding replicas yields little additional speed), and (2) training becomes less stable (loss occasionally spikes or diverges).

Write an analysis that identifies the most likely root causes of BOTH problems and proposes a concrete mitigation plan. Your answer must explicitly connect how data parallel gradient synchronization, pipeline micro-batching, model sharding, and mixed precision interact (e.g., communication volume/frequency, pipeline bubbles/latency hiding, effective batch size and update frequency, and numerical stability during gradient aggregation). Conclude by recommending one revised configuration (at a high level) and justify the tradeoffs you are making.

Diagnosing a Scaling Regression in Hybrid Parallel LLM Training

You are the on-call ML platform lead for a company training a 30B-parameter transformer. You have access to two clusters:

- Cluster A: 8 GPUs/node, 80 GB VRAM each, fast NVLink within node, 200 Gbps inter-node network.
- Cluster B: 8 GPUs/node, 40 GB VRAM each, slower interconnect within node, 100 Gbps inter-node network.

The team’s current setup uses pure data parallelism with FP16 mixed precision (FP16 compute, FP32 master weights). On Cluster A it trains stably but is slower than expected; on Cluster B it frequently hits out-of-memory errors unless the global batch size is reduced so much that throughput collapses.

You are asked to propose ONE distributed training configuration that can run on both clusters with minimal code divergence. Your proposal must specify how you will combine (a) data parallelism, (b) model parallelism, (c) pipeline parallelism (including whether you will use micro-batches), and (d) mixed precision choices, and it must justify the key tradeoffs you are making between memory fit, communication overhead, device utilization, and numerical stability.

What configuration do you recommend, and why is it the best compromise given the two clusters’ constraints?

Stabilizing and Scaling an LLM Training Job Across Two GPU Clusters

You are the tech lead for an internal LLM training platform. Your team is moving a 30B-parameter transformer pretraining job to a new cluster: 32 GPUs total, each with 24 GB VRAM, connected with high-bandwidth interconnect. The model does not fit on a single GPU in FP32, and the team wants to maximize tokens/sec while keeping training stable.

Two candidate configurations are proposed:

A) Pure data parallelism across all 32 GPUs, using mixed precision (FP16 compute with FP32 master weights) and gradient all-reduce every step.

B) A hybrid approach: split the model into 4 sequential pipeline stages (pipeline parallelism) with micro-batches, use model parallelism within each stage across 2 GPUs to fit the largest layers, and then use data parallelism across the remaining replicas; also use mixed precision (FP16 compute with FP32 master weights).

During a pilot run, the team observes:
- With A, the job cannot start due to out-of-memory errors even at small batch sizes.
- With B, the job starts, but tokens/sec is lower than expected and some GPUs show periodic idle gaps.

As the decision-maker, which configuration should you choose and what specific adjustment(s) would you make to address the observed issues while preserving numerical stability? In your answer, explicitly connect (1) why the memory constraint rules out or enables a strategy, (2) how data/model/pipeline parallelism interact to affect utilization and communication, and (3) how mixed precision changes both memory headroom and stability requirements.

Choosing a Distributed Training Configuration After a Hardware Refresh

You are the on-call ML engineer for a corporate LLM fine-tuning job that must finish within a weekend. The model is 30B parameters and does not fit in the memory of a single 80GB GPU in full precision. You have access to 8 identical 80GB GPUs connected with high-bandwidth interconnect. The team’s first attempt used pure data parallelism (one full model replica per GPU) and failed with out-of-memory errors. A second attempt split the model across 4 GPUs (layer-wise model parallelism) and ran, but GPU utilization was low because only one stage was busy at a time; throughput was far below target. A third attempt enabled mixed precision (FP16 compute) and ran faster, but training became numerically unstable unless the learning rate was reduced so much that the weekend deadline was missed.

As the incident owner, propose a single revised distributed training design that uses (a) an appropriate combination of data parallelism, model parallelism, and pipeline parallelism, and (b) mixed precision in a way that improves both memory feasibility and throughput while maintaining numerical stability. Your answer must explicitly justify: (1) how your design resolves the original OOM issue, (2) how it avoids the low-utilization problem seen in the sequential model split, and (3) what specific mixed-precision practice(s) you would use to reduce instability while keeping most of the speed/memory benefits.

Selecting a Hybrid Parallelism + Mixed-Precision Strategy for a Memory-Bound LLM Training Run

Network partitioning, also known as layer-wise model parallelism, is a multiple-GPU training strategy where the neural network is divided sequentially across devices. Each GPU takes the input for a specific set of layers, processes it, and transfers the intermediate activations to the next GPU. While this controls the memory footprint per GPU and allows for the training of larger networks, it introduces significant bottlenecks. The interfaces between layers require tight synchronization and massive data transfers of activations and gradients, which can easily overwhelm GPU bus bandwidth. Furthermore, ensuring that sequential computational workloads are evenly matched between layers is highly difficult, making linear scaling challenging to achieve.

Network Partitioning

Layerwise partitioning, commonly known as tensor parallelism, is a multiple-GPU training strategy that splits the computational work within individual network layers across different devices. For example, instead of computing all channels of a convolutional layer on a single GPU, the workload can be distributed so that multiple GPUs each compute a fraction of the channels. This approach scales effectively in terms of computation and enables the processing of larger networks by pooling the memory of multiple GPUs. However, because each layer's computation depends on the aggregated results from all other participating GPUs, this strategy requires a massive number of synchronization barriers and incurs exorbitant bandwidth costs for data transfer.

Layerwise Partitioning

Model parallelism is a technique used when a model is too large to be loaded and executed on a single device, making data parallelism impractical. Unlike data parallelism, which requires each worker to have a full copy of the model for both forward and backward passes, model parallelism involves partitioning the model itself into smaller components. These components are then distributed and run on different devices.

Google

In the context of training Large Language Models, parallelism can be implemented through several distinct approaches. The primary forms include data parallelism, model parallelism, tensor parallelism, and pipeline parallelism.

Types of Parallelism in LLM Training

Reference of Foundations of Large Language Models Course

Data parallelism is a widely used and highly convenient strategy for distributing deep learning training across multiple GPUs. In this approach, every GPU maintains a complete replica of the model and performs the identical sequence of operations, but each processes a different subset of the training minibatch. After each minibatch, the independently computed gradients are aggregated across all GPUs to synchronize and update the model parameters. To maximize efficiency, it is highly desirable to overlap computation and communication by exchanging gradients for some parameters while others are still being computed. While data parallelism enables larger effective minibatch sizes and increases overall training throughput, it is ultimately constrained by the memory of a single GPU and does not facilitate the training of larger models.

Data Parallelism

Model Parallelism

A research team is developing a novel language model with several trillion parameters. During the initial training setup, they discover that the model is too large to fit into the memory of a single available accelerator (e.g., a GPU). Which parallelism strategy is specifically designed to address this fundamental constraint?

Match each parallelism strategy with the description that best defines its core mechanism for distributing the training workload.

A machine learning team is training a large model partitioned across four accelerators, where each accelerator holds a different sequential segment of the model. They notice that their monitoring tools show a 'bubble' of inactivity that propagates through the accelerators; only one device is active at any given time during a forward or backward pass, leading to poor overall hardware utilization. What specific type of parallelism is designed to solve this exact problem, and how does it achieve better hardware utilization?

Learn Before

Related

Learn After