Learn Before
Loss Gradient over a Mini-batch
The expression represents the gradient of the loss function, , with respect to the model parameters, . This gradient is computed on a specific mini-batch of training samples, , and indicates the direction of the steepest increase in the loss for that batch.

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
An Example of Mini-Batches
Mini-Batch Gradient Descent Algorithm
Batch vs Stochastic vs Mini-Batch Gradient Descent
Example Using Mini-Batch Gradient Descent (Learning Rate Decay)
Mini-Batches Size
Which of these statements about mini-batch gradient descent do you agree with?
Why is the best mini-batch size usually not 1 and not m, but instead something in-between?
Suppose your learning algorithm’s cost J, plotted as a function of the number of iterations, looks like the image below:
Stochastic Gradient Descent Algorithm
Loss Gradient over a Mini-batch
Learn After
Distributed Gradient Calculation
An engineer is training a model using mini-batches and notices that while the overall training loss is decreasing over many updates, the loss value for individual mini-batches fluctuates significantly—sometimes increasing from one batch to the next. Which statement best analyzes the fundamental reason for this behavior based on the properties of the mini-batch loss gradient?
Analyzing Gradient Magnitude
Comparing Gradient Calculation Methods