1Cademy - Distributed Gradient Calculation

Learn Before

Formula

Distributed Gradient Calculation

In distributed training using data parallelism, the gradient of the loss function, $L$ , with respect to the parameters, $\theta_t$ , for a complete mini-batch, $\mathcal{D}_{\mathrm{mini}}$ , is computed by summing the gradients from multiple workers. Each worker calculates the gradient on a separate partition of the mini-batch, denoted as $\mathcal{D}^1, \mathcal{D}^2, \dots, \mathcal{D}^N$ . This aggregation of gradients is represented by the formula:

$\frac{\partial L_{\theta_t}(\mathcal{D}_{\mathrm{mini}})}{\partial \theta_t} = \underbrace{\frac{\partial L_{\theta_t}(\mathcal{D}^1)}{\partial \theta_t}}_{\textrm{worker 1}} + \underbrace{\frac{\partial L_{\theta_t}(\mathcal{D}^2)}{\partial \theta_t}}_{\textrm{worker 2}} + \cdots + \underbrace{\frac{\partial L_{\theta_t}(\mathcal{D}^N)}{\partial \theta_t}}_{\textrm{worker } N}$