Learn Before
Aggregated Gradient Calculation
In a distributed training setup, a mini-batch of data is split between two workers. Each worker computes a gradient vector for a model's two parameters based on its portion of the data. Given the results below, what is the final aggregated gradient vector that will be used to update the model's parameters? Explain the simple mathematical operation used to combine the worker gradients.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
In a data-parallel distributed training setup with four workers, a mini-batch is split equally among them. For a particular training step, the gradient vectors calculated on three of the workers have a similar, small magnitude. However, the fourth worker calculates a gradient vector with a magnitude ten times larger than the others, possibly due to a corrupted data sample. According to the standard aggregation method for this setup, what is the most likely effect on the combined gradient used to update the model's parameters?
Aggregated Gradient Calculation
Aggregated Gradient Calculation