Learn Before
Comparing Gradient Calculation Methods
Consider two scenarios for updating a model's parameters: one using the gradient calculated from a single, small subset of the training data, and the other using the gradient calculated from the entire training dataset. Explain the fundamental difference in the information provided by these two gradients and justify why, despite this difference, using the gradient from the small subset is a standard and effective practice in training large models.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Distributed Gradient Calculation
An engineer is training a model using mini-batches and notices that while the overall training loss is decreasing over many updates, the loss value for individual mini-batches fluctuates significantly—sometimes increasing from one batch to the next. Which statement best analyzes the fundamental reason for this behavior based on the properties of the mini-batch loss gradient?
Analyzing Gradient Magnitude
Comparing Gradient Calculation Methods