Learn Before
Activity (Process)

Gradient Accumulation in Mixed Precision Training

A key operation in mixed precision training is gradient accumulation, which involves summing and synchronizing gradients from all distributed nodes before updating the model's parameters. However, this process can introduce numerical challenges, particularly at scale. The non-associative nature of floating-point addition can lead to inconsistencies in the accumulated gradients, potentially impacting the model's convergence and final performance.

0

1

Updated 2026-04-21

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences