Learn Before
In a data-parallel distributed training setup with four workers, a mini-batch is split equally among them. For a particular training step, the gradient vectors calculated on three of the workers have a similar, small magnitude. However, the fourth worker calculates a gradient vector with a magnitude ten times larger than the others, possibly due to a corrupted data sample. According to the standard aggregation method for this setup, what is the most likely effect on the combined gradient used to update the model's parameters?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
In a data-parallel distributed training setup with four workers, a mini-batch is split equally among them. For a particular training step, the gradient vectors calculated on three of the workers have a similar, small magnitude. However, the fourth worker calculates a gradient vector with a magnitude ten times larger than the others, possibly due to a corrupted data sample. According to the standard aggregation method for this setup, what is the most likely effect on the combined gradient used to update the model's parameters?
Aggregated Gradient Calculation
Aggregated Gradient Calculation