Learn Before
Example

Example of Model Parallelism with a Transformer Decoder

Layer-wise model parallelism can be applied to a Transformer decoder composed of LL stacked blocks. To distribute the computational load, each block is assigned to a different worker. During a single run of the model, the forward pass (\uparrow) processes the input sequentially from the lowest layer (Worker 1) up to the highest layer (Worker LL). Subsequently, the backward pass (\downarrow) propagates error gradients in the reverse order, moving from Worker LL back down to Worker 1. This creates a sequential execution flow across the workers, which can be visualized as follows:

Worker LLBL\mathrm{B}_L (\uparrow)BL\mathrm{B}_L (\downarrow)
.........
Worker 2B2\mathrm{B}_2 (\uparrow)B2\mathrm{B}_2 (\downarrow)
Worker 1B1\mathrm{B}_1 (\uparrow)B1\mathrm{B}_1 (\downarrow)

0

1

Updated 2026-04-21

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Computing Sciences

Foundations of Large Language Models Course