1Cademy - Choosing a Distributed Training Configuration After a Hardware Refresh

Learn Before

Case Study

Choosing a Distributed Training Configuration After a Hardware Refresh

You are the tech lead for an internal LLM training platform. Your team is moving a 30B-parameter transformer pretraining job to a new cluster: 32 GPUs total, each with 24 GB VRAM, connected with high-bandwidth interconnect. The model does not fit on a single GPU in FP32, and the team wants to maximize tokens/sec while keeping training stable.

Two candidate configurations are proposed:

A) Pure data parallelism across all 32 GPUs, using mixed precision (FP16 compute with FP32 master weights) and gradient all-reduce every step.

B) A hybrid approach: split the model into 4 sequential pipeline stages (pipeline parallelism) with micro-batches, use model parallelism within each stage across 2 GPUs to fit the largest layers, and then use data parallelism across the remaining replicas; also use mixed precision (FP16 compute with FP32 master weights).

During a pilot run, the team observes:

With A, the job cannot start due to out-of-memory errors even at small batch sizes.
With B, the job starts, but tokens/sec is lower than expected and some GPUs show periodic idle gaps.

As the decision-maker, which configuration should you choose and what specific adjustment(s) would you make to address the observed issues while preserving numerical stability? In your answer, explicitly connect (1) why the memory constraint rules out or enables a strategy, (2) how data/model/pipeline parallelism interact to affect utilization and communication, and (3) how mixed precision changes both memory headroom and stability requirements.

Updated 2026-02-06

Contributors are:

Who are from:

Learn Before

Related