Google

Pipeline parallelism is a strategy designed to overcome the inefficiency of basic model parallelism, where hardware is underutilized because only one device is active at any given moment. This technique introduces computational overlap by dividing a data batch into smaller units called micro-batches. These micro-batches are fed into a pipeline of workers, allowing a worker to begin processing the next micro-batch as soon as it has passed the current one to the subsequent worker. This creates a continuous flow of computation, ensuring that different devices are working simultaneously on different stages of the process.

Pipeline Parallelism

The core mechanism of pipeline parallelism involves partitioning a data batch into several smaller 'micro-batches'. These micro-batches are then fed sequentially into the pipeline of workers. As soon as a worker completes its computation for one micro-batch and forwards it to the subsequent worker, it immediately begins processing the next available micro-batch. This continuous flow ensures that different stages of the computation are active simultaneously, maximizing device utilization.

Micro-batching in Pipeline Parallelism

An illustration of pipeline parallelism typically demonstrates how computation is staggered across multiple workers (e.g., $$L$$ workers) to process multiple micro-batches. Let $$\mathrm{B}_{l,k}$$ denote the processing of the $$k$$-th micro-batch by the $$l$$-th worker. A pipeline is created where a subsequent worker begins processing a micro-batch immediately after the preceding worker has completed its step and passed it along. This staggered, overlapping execution allows multiple workers to be active concurrently on different micro-batches, which significantly maximizes hardware utilization and minimizes the idle time that occurs in simpler sequential approaches.

Illustration of Pipeline Parallelism with Micro-batches

A large neural network model is partitioned across four sequential processing stages, with each stage running on a separate hardware device. During training, a full batch of data is processed entirely by the first device, and its output is then passed to the second device. The second device processes this output and passes its result to the third, and so on. While one device is actively computing, the other three devices are idle, waiting for their turn. What is the primary inefficiency this specific computational strategy introduces?

A large computational model is partitioned across two hardware devices (Device 1 and Device 2) in a sequential pipeline. To improve efficiency, a data batch is divided into two smaller micro-batches. Arrange the following events in the correct chronological order to accurately represent the flow of computation that maximizes hardware utilization.

Based on the scenario provided, explain the fundamental reason for the low hardware utilization. Then, describe a modification to the team's training process that would allow multiple devices to compute simultaneously on different parts of the data batch, thereby increasing overall efficiency.

Optimizing Training Efficiency for a Large Model

Your team must train a 30B-parameter LLM on a sing...

You are on-call for an internal LLM training platf...

Your team is training a 70B-parameter LLM on 8 GPU...

You’re advising an internal platform team that mus...

You are the tech lead for training a new LLM that cannot fit on a single GPU due to parameter/activation memory, but leadership also expects near-linear throughput scaling when moving from 8 to 32 GPUs. Your cluster has 32 identical GPUs connected with high-bandwidth intra-node links and slower inter-node links. You must choose a distributed training approach that combines (as needed) data parallelism, model parallelism, pipeline parallelism (with micro-batching), and mixed precision training.

Write a recommendation memo that proposes a concrete parallelism/mixed-precision strategy and justifies it. Your memo must: (1) explain how your design resolves the single-GPU out-of-memory issue, (2) explain where and why gradient synchronization/communication happens and how it affects scaling, (3) explain how pipeline micro-batching changes device utilization compared with naive layer-splitting, and (4) explain how mixed precision improves speed/memory while still keeping training numerically stable (e.g., what stays in higher precision and why). Conclude by identifying the most likely bottleneck that will prevent perfect 32x scaling in your design and one mitigation you would try first.

Designing a Distributed Training Plan Under Memory, Throughput, and Stability Constraints

You are the on-call ML engineer for a corporate LLM fine-tuning job running on 8 GPUs (each 40 GB). The model is too large to fit on a single GPU in full precision, so the team split the model across 4 GPUs in sequential stages (a pipeline). To increase throughput, they also run 2 identical pipeline replicas (so all 8 GPUs are used) and split each global mini-batch across the 2 replicas. They enabled mixed precision so most compute uses FP16, while a master copy of weights is kept in FP32 for updates.

After several hours, the run shows two problems: (1) training loss becomes unstable and occasionally spikes to NaN; (2) GPU utilization is uneven—some GPUs are frequently idle even though the input pipeline is not the bottleneck.

Write a postmortem-style response that (a) identifies the most plausible root causes that connect the chosen parallelism strategy (data parallel across replicas + model/pipeline parallel within a replica) with mixed precision behavior, and (b) proposes a concrete redesign of the training step to address BOTH numerical stability and utilization. Your answer must explicitly explain the interactions/tradeoffs among gradient aggregation across replicas, micro-batching in the pipeline, and FP16/FP32 precision choices (e.g., where precision should be used and why), and justify why your redesign would reduce NaNs while improving end-to-end throughput.

Postmortem and Redesign of a Distributed LLM Training Run with Divergence and Low GPU Utilization

You are the on-call ML engineer for a team training a 30B-parameter LLM on a 64-GPU cluster (8 nodes × 8 GPUs). The model does not fit on a single GPU, so the team shards the model across 4 GPUs per replica (model parallelism) and uses pipeline parallelism with micro-batches to keep those 4 GPUs busy. They then replicate this 4-GPU pipeline across the remaining GPUs using data parallelism, synchronizing gradients across replicas each step. To reduce memory and increase throughput, they enable mixed precision (FP16 compute with FP32 master weights).

After a change request to “increase throughput,” the team doubles the number of data-parallel replicas (more pipelines in parallel) and also increases the number of micro-batches per step. Throughput improves, but two problems appear: (1) scaling efficiency drops sharply (adding replicas yields little additional speed), and (2) training becomes less stable (loss occasionally spikes or diverges).

Write an analysis that identifies the most likely root causes of BOTH problems and proposes a concrete mitigation plan. Your answer must explicitly connect how data parallel gradient synchronization, pipeline micro-batching, model sharding, and mixed precision interact (e.g., communication volume/frequency, pipeline bubbles/latency hiding, effective batch size and update frequency, and numerical stability during gradient aggregation). Conclude by recommending one revised configuration (at a high level) and justify the tradeoffs you are making.

Diagnosing a Scaling Regression in Hybrid Parallel LLM Training

You are the on-call ML platform lead for a company training a 30B-parameter transformer. You have access to two clusters:

- Cluster A: 8 GPUs/node, 80 GB VRAM each, fast NVLink within node, 200 Gbps inter-node network.
- Cluster B: 8 GPUs/node, 40 GB VRAM each, slower interconnect within node, 100 Gbps inter-node network.

The team’s current setup uses pure data parallelism with FP16 mixed precision (FP16 compute, FP32 master weights). On Cluster A it trains stably but is slower than expected; on Cluster B it frequently hits out-of-memory errors unless the global batch size is reduced so much that throughput collapses.

You are asked to propose ONE distributed training configuration that can run on both clusters with minimal code divergence. Your proposal must specify how you will combine (a) data parallelism, (b) model parallelism, (c) pipeline parallelism (including whether you will use micro-batches), and (d) mixed precision choices, and it must justify the key tradeoffs you are making between memory fit, communication overhead, device utilization, and numerical stability.

What configuration do you recommend, and why is it the best compromise given the two clusters’ constraints?

Stabilizing and Scaling an LLM Training Job Across Two GPU Clusters

You are the tech lead for an internal LLM training platform. Your team is moving a 30B-parameter transformer pretraining job to a new cluster: 32 GPUs total, each with 24 GB VRAM, connected with high-bandwidth interconnect. The model does not fit on a single GPU in FP32, and the team wants to maximize tokens/sec while keeping training stable.

Two candidate configurations are proposed:

A) Pure data parallelism across all 32 GPUs, using mixed precision (FP16 compute with FP32 master weights) and gradient all-reduce every step.

B) A hybrid approach: split the model into 4 sequential pipeline stages (pipeline parallelism) with micro-batches, use model parallelism within each stage across 2 GPUs to fit the largest layers, and then use data parallelism across the remaining replicas; also use mixed precision (FP16 compute with FP32 master weights).

During a pilot run, the team observes:
- With A, the job cannot start due to out-of-memory errors even at small batch sizes.
- With B, the job starts, but tokens/sec is lower than expected and some GPUs show periodic idle gaps.

As the decision-maker, which configuration should you choose and what specific adjustment(s) would you make to address the observed issues while preserving numerical stability? In your answer, explicitly connect (1) why the memory constraint rules out or enables a strategy, (2) how data/model/pipeline parallelism interact to affect utilization and communication, and (3) how mixed precision changes both memory headroom and stability requirements.

Choosing a Distributed Training Configuration After a Hardware Refresh

You are the on-call ML engineer for a corporate LLM fine-tuning job that must finish within a weekend. The model is 30B parameters and does not fit in the memory of a single 80GB GPU in full precision. You have access to 8 identical 80GB GPUs connected with high-bandwidth interconnect. The team’s first attempt used pure data parallelism (one full model replica per GPU) and failed with out-of-memory errors. A second attempt split the model across 4 GPUs (layer-wise model parallelism) and ran, but GPU utilization was low because only one stage was busy at a time; throughput was far below target. A third attempt enabled mixed precision (FP16 compute) and ran faster, but training became numerically unstable unless the learning rate was reduced so much that the weekend deadline was missed.

As the incident owner, propose a single revised distributed training design that uses (a) an appropriate combination of data parallelism, model parallelism, and pipeline parallelism, and (b) mixed precision in a way that improves both memory feasibility and throughput while maintaining numerical stability. Your answer must explicitly justify: (1) how your design resolves the original OOM issue, (2) how it avoids the low-utilization problem seen in the sequential model split, and (3) what specific mixed-precision practice(s) you would use to reduce instability while keeping most of the speed/memory benefits.

Learn Before

Related