1Cademy - A research team is training a large model on a distributed system using a low-precision floating-point format for gradient calculations. They run two identical experiments, with the only difference being how the gradients from different compute nodes are summed before the model update. In Experiment A, the gradients are always summed in a fixed, deterministic order. In Experiment B, the summation order varies unpredictably in each training step due to network latency. What is the most likely outcome when comparing the final accumulated gradient values at each step between the two experiments?

Learn Before

Gradient Accumulation in Mixed Precision Training

Multiple Choice

A research team is training a large model on a distributed system using a low-precision floating-point format for gradient calculations. They run two identical experiments, with the only difference being how the gradients from different compute nodes are summed before the model update. In Experiment A, the gradients are always summed in a fixed, deterministic order. In Experiment B, the summation order varies unpredictably in each training step due to network latency. What is the most likely outcome when comparing the final accumulated gradient values at each step between the two experiments?

0

1

Updated 2025-10-10

Contributors are:

Who are from:

Learn Before

Related