Learn Before
Formula

Sparse Attention Output Formula

For language models employing sparse attention, the output for a query token qi\mathbf{q}_i at position ii is calculated as a weighted sum of value vectors. This computation is restricted to a specific subset of indices, G{0,,i}G \subseteq \{0, \dots, i\}, where the attention weights are considered non-zero. The formula for the sparse attention output is:

Attsparse(qi,Ki,Vi)=jGαi,jvj\mathrm{Att}_{\mathrm{sparse}}(\mathbf{q}_i, \mathbf{K}_{\le i}, \mathbf{V}_{\le i}) = \sum_{j \in G} \alpha'_{i,j} \mathbf{v}_j

Here, Ki=[k0ki]\mathbf{K}_{\le i} = \begin{bmatrix} \mathbf{k}_0 \\ \vdots \\ \mathbf{k}_{i} \end{bmatrix} and Vi=[v0vi]\mathbf{V}_{\le i} = \begin{bmatrix} \mathbf{v}_0 \\ \vdots \\ \mathbf{v}_{i} \end{bmatrix} represent the keys and values up to position ii. The summation iterates only over the indices jj in the sparse set GG, applying the non-zero sparse attention weights αi,j\alpha'_{i,j} to their corresponding value vectors vj\mathbf{v}_j.

Image 0

0

1

Updated 2026-04-22

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences