Multiple Choice

In a data-parallel distributed training setup with four workers, a mini-batch is split equally among them. For a particular training step, the gradient vectors calculated on three of the workers have a similar, small magnitude. However, the fourth worker calculates a gradient vector with a magnitude ten times larger than the others, possibly due to a corrupted data sample. According to the standard aggregation method for this setup, what is the most likely effect on the combined gradient used to update the model's parameters?

0

1

Updated 2025-10-01

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science