Activity (Process)

Sequence Parallelism

Sequence parallelism is a method for handling long sequences by distributing the attention computation across multiple processing units. When calculating the attention of a query qi\mathbf{q}_i at position ii with respect to the keys K\mathbf{K} and values V\mathbf{V}, the matrices K\mathbf{K} and V\mathbf{V} are partitioned row-wise into sets of sub-matrices, denoted as {K[1],,K[nu]}\left\{\mathbf{K}^{[1]}, \dots, \mathbf{K}^{[n_u]}\right\} and {V[1],,V[nu]}\left\{\mathbf{V}^{[1]}, \dots, \mathbf{V}^{[n_u]}\right\}, where each sub-matrix corresponds to a segment of the sequence. Each corresponding pair of sub-matrices, K[u]\mathbf{K}^{[u]} and V[u]\mathbf{V}^{[u]}, is then allocated to a separate computing node (e.g., a GPU in a cluster). The assigned nodes execute in parallel, effectively parallelizing the attention operation for the entire sequence.

0

1

Updated 2026-04-22

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related