Learn Before
Case Study

Selecting a Hybrid Parallelism + Mixed-Precision Strategy for a Memory-Bound LLM Training Run

You are the on-call ML engineer for a corporate LLM fine-tuning job that must finish within a weekend. The model is 30B parameters and does not fit in the memory of a single 80GB GPU in full precision. You have access to 8 identical 80GB GPUs connected with high-bandwidth interconnect. The team’s first attempt used pure data parallelism (one full model replica per GPU) and failed with out-of-memory errors. A second attempt split the model across 4 GPUs (layer-wise model parallelism) and ran, but GPU utilization was low because only one stage was busy at a time; throughput was far below target. A third attempt enabled mixed precision (FP16 compute) and ran faster, but training became numerically unstable unless the learning rate was reduced so much that the weekend deadline was missed.

As the incident owner, propose a single revised distributed training design that uses (a) an appropriate combination of data parallelism, model parallelism, and pipeline parallelism, and (b) mixed precision in a way that improves both memory feasibility and throughput while maintaining numerical stability. Your answer must explicitly justify: (1) how your design resolves the original OOM issue, (2) how it avoids the low-utilization problem seen in the sequential model split, and (3) what specific mixed-precision practice(s) you would use to reduce instability while keeping most of the speed/memory benefits.

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related