1Cademy - Postmortem and Redesign of a Distributed LLM Training Run with Divergence and Low GPU Utilization

Learn Before

Essay

Postmortem and Redesign of a Distributed LLM Training Run with Divergence and Low GPU Utilization

You are the on-call ML engineer for a corporate LLM fine-tuning job running on 8 GPUs (each 40 GB). The model is too large to fit on a single GPU in full precision, so the team split the model across 4 GPUs in sequential stages (a pipeline). To increase throughput, they also run 2 identical pipeline replicas (so all 8 GPUs are used) and split each global mini-batch across the 2 replicas. They enabled mixed precision so most compute uses FP16, while a master copy of weights is kept in FP32 for updates.

After several hours, the run shows two problems: (1) training loss becomes unstable and occasionally spikes to NaN; (2) GPU utilization is uneven—some GPUs are frequently idle even though the input pipeline is not the bottleneck.

Write a postmortem-style response that (a) identifies the most plausible root causes that connect the chosen parallelism strategy (data parallel across replicas + model/pipeline parallel within a replica) with mixed precision behavior, and (b) proposes a concrete redesign of the training step to address BOTH numerical stability and utilization. Your answer must explicitly explain the interactions/tradeoffs among gradient aggregation across replicas, micro-batching in the pipeline, and FP16/FP32 precision choices (e.g., where precision should be used and why), and justify why your redesign would reduce NaNs while improving end-to-end throughput.

Updated 2026-02-06

Contributors are:

Who are from:

Learn Before

Related