Concept

Computing Attention Weights in Sequence Parallelism

When calculating attention weights across multiple computing nodes in a sequence parallel setup, determining the numerator of the Softmax formula, exp(βi,j)\exp(\beta_{i,j}), is straightforward because the required keys and values are stored locally. However, computing the normalization denominator requires a summation over all positions jj', denoted as jexp(βi,j)\sum_{j'} \exp(\beta_{i,j'}). Because the sequence data is partitioned, calculating this total sum necessitates transferring data to and from other computing nodes. If keys kj\mathbf{k}_j and values vj\mathbf{v}_j are placed on a specific node uu, the attention weight αi,j\alpha_{i,j} can be calculated by explicitly dividing the denominator into sums over the key subsets K[1],,K[nu]\mathbf{K}^{[1]}, \dots, \mathbf{K}^{[n_u]} located on nodes 1{}1 through nun_u:

αi,j=exp(βi,j)node ukjK[1]exp(βi,j)node 1++kjK[u]exp(βi,j)node u++kjK[nu]exp(βi,j)node nu\alpha_{i,j} = \frac{\overbrace{\exp(\beta_{i,j})}^{\text{node } u}}{\underbrace{\sum_{\mathbf{k}_{j'} \in \mathbf{K}^{[1]}} \exp(\beta_{i,j'})}_{\text{node } 1} + \cdots + \underbrace{\sum_{\mathbf{k}_{j'} \in \mathbf{K}^{[u]}} \exp(\beta_{i,j'})}_{\text{node } u} + \cdots + \underbrace{\sum_{\mathbf{k}_{j'} \in \mathbf{K}^{[n_u]}} \exp(\beta_{i,j'})}_{\text{node } n_u}}

In this equation, the notation kjK[u]\mathbf{k}_{j'} \in \mathbf{K}^{[u]} indicates that kj\mathbf{k}_{j'} is a row vector belonging to the sub-matrix K[u]\mathbf{K}^{[u]}. This rewritten formula illustrates that while the numerator is computed entirely on node uu, the denominator must aggregate partial sums from all nun_u nodes.

0

1

Updated 2026-04-22

Contributors are:

Who are from:

Tags

Foundations of Large Language Models

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences