Learn Before
  • Data Parallelism

Gradient Descent Update Rule

The standard delta rule for gradient descent updates a model's parameters by moving them in the direction opposite to the gradient of the loss function. The update is performed according to the formula: θt+1=θtlrLθt(Dmini)θt\theta_{t+1} = \theta_t - lr \cdot \frac{\partial L_{\theta_t}(D_{\text{mini}})}{\partial \theta_t}. In this equation, θt+1\theta_{t+1} are the updated parameters, θt\theta_t are the parameters at the current step, lrlr is the learning rate, and the fractional term represents the gradient of the loss function LL with respect to the parameters θt\theta_t, computed on a mini-batch of data DminiD_{\text{mini}}.

Image 0

0

1

3 days ago

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related
  • Gradient Descent Update Rule

  • Set of Distributed Data Batches in Data Parallelism

  • Gradient Aggregation in Data Parallelism

  • Ideal Speed-up in Data Parallelism