1Cademy - Distributed Attention Weight Formula

Learn Before

Attention Weight Formula ( $\alpha_{i,j}$ )

Formula

Distributed Attention Weight Formula

When computing attention over long sequences distributed across multiple nodes, the standard formula for an attention weight, $\alpha_{i,j}$ , is adapted to reflect parallel processing. The numerator, which is the exponentiated pre-softmax score $\exp(\beta_{i,j})$ , is calculated locally on the specific node $u$ containing the relevant data. The normalization factor in the denominator is expanded into a sum of independent summations computed across all nodes from 1 to $n_u$ : $\sum_{\mathbf{k}_{j'} \in \mathbf{K}^{[1]}} \exp(\beta_{i,j'}) + \dots + \sum_{\mathbf{k}_{j'} \in \mathbf{K}^{[n_u]}} \exp(\beta_{i,j'})$ . Here, the notation $\mathbf{k}_{j'} \in \mathbf{K}^{[u]}$ signifies that $\mathbf{k}_{j'}$ is a row vector of the specific key sub-matrix, $\mathbf{K}^{[u]}$ , located on node $u$ .

0

1

Updated 2026-04-22

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn Before

Related