Formula

Distributed Gradient Calculation

In distributed training using data parallelism, the gradient of the loss function, LL, with respect to the parameters, θt\theta_t, for a complete mini-batch, Dmini\mathcal{D}_{\mathrm{mini}}, is computed by summing the gradients from multiple workers. Each worker calculates the gradient on a separate partition of the mini-batch, denoted as D1,D2,,DN\mathcal{D}^1, \mathcal{D}^2, \dots, \mathcal{D}^N. This aggregation of gradients is represented by the formula:

Lθt(Dmini)θt=Lθt(D1)θtworker 1+Lθt(D2)θtworker 2++Lθt(DN)θtworker N\frac{\partial L_{\theta_t}(\mathcal{D}_{\mathrm{mini}})}{\partial \theta_t} = \underbrace{\frac{\partial L_{\theta_t}(\mathcal{D}^1)}{\partial \theta_t}}_{\textrm{worker 1}} + \underbrace{\frac{\partial L_{\theta_t}(\mathcal{D}^2)}{\partial \theta_t}}_{\textrm{worker 2}} + \cdots + \underbrace{\frac{\partial L_{\theta_t}(\mathcal{D}^N)}{\partial \theta_t}}_{\textrm{worker } N}

This allows for parallel computation, significantly speeding up the training process for large models.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related