1Cademy - Sparse Attention Output Formula

Learn Before

Sparse Attention Weights Assumption

Formula

Sparse Attention Output Formula

For language models employing sparse attention, the output for a query token $\mathbf{q}_i$ at position $i$ is calculated as a weighted sum of value vectors. This computation is restricted to a specific subset of indices, $G \subseteq \{0, \dots, i\}$ , where the attention weights are considered non-zero. The formula for the sparse attention output is:

$\mathrm{Att}_{\mathrm{sparse}}(\mathbf{q}_i, \mathbf{K}_{\le i}, \mathbf{V}_{\le i}) = \sum_{j \in G} \alpha'_{i,j} \mathbf{v}_j$

Here, $\mathbf{K}_{\le i} = \begin{bmatrix} \mathbf{k}_0 \\ \vdots \\ \mathbf{k}_{i} \end{bmatrix}$ and $\mathbf{V}_{\le i} = \begin{bmatrix} \mathbf{v}_0 \\ \vdots \\ \mathbf{v}_{i} \end{bmatrix}$ represent the keys and values up to position $i$ . The summation iterates only over the indices $j$ in the sparse set $G$ , applying the non-zero sparse attention weights $\alpha'_{i,j}$ to their corresponding value vectors $\mathbf{v}_j$ .

0

1

Updated 2026-04-22

Contributors are:

Who are from:

References

Learn Before

Related

Learn After