The core mechanism of pipeline parallelism involves partitioning a data batch into several smaller 'micro-batches'. These micro-batches are then fed sequentially into the pipeline of workers. As soon as a worker completes its computation for one micro-batch and forwards it to the subsequent worker, it immediately begins processing the next available micro-batch. This continuous flow ensures that different stages of the computation are active simultaneously, maximizing device utilization.

Micro-batching in Pipeline Parallelism

The provided illustration demonstrates pipeline parallelism with L workers and multiple micro-batches. Each block, labeled $B_{l,k}$, represents the processing of the k-th micro-batch by the l-th worker. The process unfolds in a staggered manner, creating a pipeline where a subsequent worker begins processing a micro-batch only after the preceding worker has finished with it. This allows multiple workers to be active concurrently on different micro-batches, maximizing hardware utilization by overlapping computations and minimizing the idle time inherent in simpler model parallelism approaches.

Illustration of Pipeline Parallelism with Micro-batches

Pipeline parallelism is a technique designed to mitigate the inefficiency of basic model parallelism, where only one device is active at any given moment. It addresses this by enabling computational overlap between different devices. This method is based on the principle of creating a processing pipeline where multiple computational steps can be executed concurrently across the distributed hardware.

Google

In the context of training Large Language Models, parallelism can be implemented through several distinct approaches. The primary forms include data parallelism, model parallelism, tensor parallelism, and pipeline parallelism.

Types of Parallelism in LLM Training

Reference of Foundations of Large Language Models Course

Data parallelism is a training method where a mini-batch of data is split across multiple workers. Each worker, holding a complete copy of the model, processes its assigned data shard in parallel to compute local loss gradients. These gradients are then aggregated to calculate the overall gradient for the entire mini-batch, which is used to update the model's parameters. Under ideal conditions with low communication overhead, this approach can accelerate training by a factor close to the number of workers (N).

Data Parallelism

While data parallelism is effective, it is limited by the requirement that each worker must store and execute the entire model. When a Large Language Model (LLM) becomes too large to fit on a single device, this approach is no longer feasible. Model parallelism addresses this by partitioning the model itself into smaller components and distributing these components across different devices, making it possible to train models that exceed the memory capacity of any single worker.

Learn Before

Related

Learn After