1Cademy - In a data-parallel distributed training setup with four workers, a mini-batch is split equally among them. For a particular training step, the gradient vectors calculated on three of the workers have a similar, small magnitude. However, the fourth worker calculates a gradient vector with a magnitude ten times larger than the others, possibly due to a corrupted data sample. According to the standard aggregation method for this setup, what is the most likely effect on the combined gradient used to update the models parameters?

Learn Before

Distributed Gradient Calculation

Multiple Choice

In a data-parallel distributed training setup with four workers, a mini-batch is split equally among them. For a particular training step, the gradient vectors calculated on three of the workers have a similar, small magnitude. However, the fourth worker calculates a gradient vector with a magnitude ten times larger than the others, possibly due to a corrupted data sample. According to the standard aggregation method for this setup, what is the most likely effect on the combined gradient used to update the model's parameters?

Updated 2025-10-01

Contributors are:

Who are from:

Learn Before

Related