1Cademy - Gradient Accumulation in Mixed Precision Training

Learn Before

Mixed Precision Training

Activity (Process)

Gradient Accumulation in Mixed Precision Training

A key operation in mixed precision training is gradient accumulation, which involves summing and synchronizing gradients from all distributed nodes before updating the model's parameters. However, this process can introduce numerical challenges, particularly at scale. The non-associative nature of floating-point addition can lead to inconsistencies in the accumulated gradients, potentially impacting the model's convergence and final performance.

Updated 2026-04-21

Contributors are: