Concept

Data Parallelism

Data parallelism stands out as one of the most prevalent methods for training neural networks. The technique operates by dividing a mini-batch of data among multiple workers, where each worker is equipped with a full replica of the model. Each worker processes its assigned data portion in parallel to compute local loss gradients. These individual gradients are then combined to form the overall gradient for the entire mini-batch, which is used to update the model's parameters. The process can be illustrated in its simplest form by considering the use of the standard delta rule within gradient descent. When communication overhead is minimal, this approach can ideally speed up training by a factor proportional to the number of workers.

Image 0

0

1

Updated 2026-04-21

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Learn After