1Cademy - Diagnosing Model Divergence in Distributed Training

Learn Before

Gradient Accumulation in Mixed Precision Training

Case Study

Diagnosing Model Divergence in Distributed Training

Based on the principles of numerical computation in this training setup, what is the most probable cause for the model weights diverging across different GPUs, and why does this issue become prominent when using low-precision formats for gradients?

Updated 2025-10-05

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

Impact of Floating-Point Non-Associativity in Gradient Accumulation
A team is training a large model using a distributed setup where each node computes gradients in a 16-bit floating-point format to save memory and improve speed. The main copy of the model's parameters is maintained in a more stable 32-bit floating-point format. Before updating these main parameters, the 16-bit gradients from all nodes are collected and summed together. Why is it standard practice to perform this summation into a 32-bit buffer before applying the final update?
Diagnosing Model Divergence in Distributed Training
A research team is training a large model on a distributed system using a low-precision floating-point format for gradient calculations. They run two identical experiments, with the only difference being how the gradients from different compute nodes are summed before the model update. In Experiment A, the gradients are always summed in a fixed, deterministic order. In Experiment B, the summation order varies unpredictably in each training step due to network latency. What is the most likely out

Learn Before

Related