Formula

Distributed Attention Weight Formula

When computing attention over long sequences distributed across multiple nodes, the standard formula for an attention weight, αi,j\alpha_{i,j}, is adapted to reflect parallel processing. The numerator, which is the exponentiated pre-softmax score exp(βi,j)\exp(\beta_{i,j}), is calculated locally on the specific node uu containing the relevant data. The normalization factor in the denominator is expanded into a sum of independent summations computed across all nodes from 1 to nun_u: kjK[1]exp(βi,j)++kjK[nu]exp(βi,j)\sum_{\mathbf{k}_{j'} \in \mathbf{K}^{[1]}} \exp(\beta_{i,j'}) + \dots + \sum_{\mathbf{k}_{j'} \in \mathbf{K}^{[n_u]}} \exp(\beta_{i,j'}). Here, the notation kjK[u]\mathbf{k}_{j'} \in \mathbf{K}^{[u]} signifies that kj\mathbf{k}_{j'} is a row vector of the specific key sub-matrix, K[u]\mathbf{K}^{[u]}, located on node uu.

0

1

Updated 2026-04-22

Contributors are:

Who are from:

Tags

Foundations of Large Language Models

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences