Learn Before
Data Parallelism
Gradient Descent Update Rule
The standard delta rule for gradient descent updates a model's parameters by moving them in the direction opposite to the gradient of the loss function. The update is performed according to the formula: . In this equation, are the updated parameters, are the parameters at the current step, is the learning rate, and the fractional term represents the gradient of the loss function with respect to the parameters , computed on a mini-batch of data .

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Gradient Descent Update Rule
Set of Distributed Data Batches in Data Parallelism
Gradient Aggregation in Data Parallelism
Ideal Speed-up in Data Parallelism