Learn Before
Essay

Diagnosing a Scaling Regression in Hybrid Parallel LLM Training

You are the on-call ML engineer for a team training a 30B-parameter LLM on a 64-GPU cluster (8 nodes × 8 GPUs). The model does not fit on a single GPU, so the team shards the model across 4 GPUs per replica (model parallelism) and uses pipeline parallelism with micro-batches to keep those 4 GPUs busy. They then replicate this 4-GPU pipeline across the remaining GPUs using data parallelism, synchronizing gradients across replicas each step. To reduce memory and increase throughput, they enable mixed precision (FP16 compute with FP32 master weights).

After a change request to “increase throughput,” the team doubles the number of data-parallel replicas (more pipelines in parallel) and also increases the number of micro-batches per step. Throughput improves, but two problems appear: (1) scaling efficiency drops sharply (adding replicas yields little additional speed), and (2) training becomes less stable (loss occasionally spikes or diverges).

Write an analysis that identifies the most likely root causes of BOTH problems and proposes a concrete mitigation plan. Your answer must explicitly connect how data parallel gradient synchronization, pipeline micro-batching, model sharding, and mixed precision interact (e.g., communication volume/frequency, pipeline bubbles/latency hiding, effective batch size and update frequency, and numerical stability during gradient aggregation). Conclude by recommending one revised configuration (at a high level) and justify the tradeoffs you are making.

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related