Learn Before
Case Study

Stabilizing and Scaling an LLM Training Job Across Two GPU Clusters

You are the on-call ML platform lead for a company training a 30B-parameter transformer. You have access to two clusters:

  • Cluster A: 8 GPUs/node, 80 GB VRAM each, fast NVLink within node, 200 Gbps inter-node network.
  • Cluster B: 8 GPUs/node, 40 GB VRAM each, slower interconnect within node, 100 Gbps inter-node network.

The team’s current setup uses pure data parallelism with FP16 mixed precision (FP16 compute, FP32 master weights). On Cluster A it trains stably but is slower than expected; on Cluster B it frequently hits out-of-memory errors unless the global batch size is reduced so much that throughput collapses.

You are asked to propose ONE distributed training configuration that can run on both clusters with minimal code divergence. Your proposal must specify how you will combine (a) data parallelism, (b) model parallelism, (c) pipeline parallelism (including whether you will use micro-batches), and (d) mixed precision choices, and it must justify the key tradeoffs you are making between memory fit, communication overhead, device utilization, and numerical stability.

What configuration do you recommend, and why is it the best compromise given the two clusters’ constraints?

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related