Concept

Impact of Floating-Point Non-Associativity in Gradient Accumulation

A significant numerical issue in distributed training arises from the non-associative property of floating-point addition. During gradient accumulation, where gradients are summed across multiple nodes, this property can cause minor variations in the final accumulated values on different nodes. These numerical discrepancies, though small, can negatively impact the model's convergence and its ultimate performance.

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related