1Cademy - Computing Attention Weights in Sequence Parallelism

Learn Before

Concept

Computing Attention Weights in Sequence Parallelism

When calculating attention weights across multiple computing nodes in a sequence parallel setup, determining the numerator of the Softmax formula, $\exp(\beta_{i,j})$ , is straightforward because the required keys and values are stored locally. However, computing the normalization denominator requires a summation over all positions $j'$ , denoted as $\sum_{j'} \exp(\beta_{i,j'})$ . Because the sequence data is partitioned, calculating this total sum necessitates transferring data to and from other computing nodes. If keys $\mathbf{k}_j$ and values $\mathbf{v}_j$ are placed on a specific node $u$ , the attention weight $\alpha_{i,j}$ can be calculated by explicitly dividing the denominator into sums over the key subsets $\mathbf{K}^{[1]}, \dots, \mathbf{K}^{[n_u]}$ located on nodes ${}1$ through $n_u$ :

$\alpha_{i,j} = \frac{\overbrace{\exp(\beta_{i,j})}^{\text{node } u}}{\underbrace{\sum_{\mathbf{k}_{j'} \in \mathbf{K}^{[1]}} \exp(\beta_{i,j'})}_{\text{node } 1} + \cdots + \underbrace{\sum_{\mathbf{k}_{j'} \in \mathbf{K}^{[u]}} \exp(\beta_{i,j'})}_{\text{node } u} + \cdots + \underbrace{\sum_{\mathbf{k}_{j'} \in \mathbf{K}^{[n_u]}} \exp(\beta_{i,j'})}_{\text{node } n_u}}$

In this equation, the notation $\mathbf{k}_{j'} \in \mathbf{K}^{[u]}$ indicates that $\mathbf{k}_{j'}$ is a row vector belonging to the sub-matrix $\mathbf{K}^{[u]}$ . This rewritten formula illustrates that while the numerator is computed entirely on node $u$ , the denominator must aggregate partial sums from all $n_u$ nodes.

0

1

Updated 2026-04-22

Contributors are:

Who are from:

References

Learn Before

Related