1Cademy - A team is training a large model using a distributed setup where each node computes gradients in a 16-bit floating-point format to save memory and improve speed. The main copy of the models parameters is maintained in a more stable 32-bit floating-point format. Before updating these main parameters, the 16-bit gradients from all nodes are collected and summed together. Why is it standard practice to perform this summation into a 32-bit buffer before applying the final update?

Learn Before

Gradient Accumulation in Mixed Precision Training

Multiple Choice

A team is training a large model using a distributed setup where each node computes gradients in a 16-bit floating-point format to save memory and improve speed. The main copy of the model's parameters is maintained in a more stable 32-bit floating-point format. Before updating these main parameters, the 16-bit gradients from all nodes are collected and summed together. Why is it standard practice to perform this summation into a 32-bit buffer before applying the final update?

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related