1Cademy - Impact of Floating-Point Non-Associativity in Gradient Accumulation

Learn Before

Numerical Computation Issues in Distributed Training
Gradient Accumulation in Mixed Precision Training

Concept

Impact of Floating-Point Non-Associativity in Gradient Accumulation

A significant numerical issue in distributed training arises from the non-associative property of floating-point addition. During gradient accumulation, where gradients are summed across multiple nodes, this property can cause minor variations in the final accumulated values on different nodes. These numerical discrepancies, though small, can negatively impact the model's convergence and its ultimate performance.

Updated 2026-05-02

Contributors are: