1Cademy - Example of Model Parallelism with a Transformer Decoder

Learn Before

Network Partitioning

Example

Example of Model Parallelism with a Transformer Decoder

Layer-wise model parallelism can be applied to a Transformer decoder composed of $L$ stacked blocks. To distribute the computational load, each block is assigned to a different worker. During a single run of the model, the forward pass ( $\uparrow$ ) processes the input sequentially from the lowest layer (Worker 1) up to the highest layer (Worker $L$ ). Subsequently, the backward pass ( $\downarrow$ ) propagates error gradients in the reverse order, moving from Worker $L$ back down to Worker 1. This creates a sequential execution flow across the workers, which can be visualized as follows:


Worker $L$				$\mathrm{B}_L$ ( $\uparrow$ )	$\mathrm{B}_L$ ( $\downarrow$ )
...			...			...
Worker 2		$\mathrm{B}_2$ ( $\uparrow$ )					$\mathrm{B}_2$ ( $\downarrow$ )
Worker 1	$\mathrm{B}_1$ ( $\uparrow$ )							$\mathrm{B}_1$ ( $\downarrow$ )

0

1

Updated 2026-04-21

Contributors are:

Who are from:

References

Learn Before

Related

Learn After