1Cademy - Efficiency of Aggregated State in Attention

Learn Before

Linear Causal Attention Formula

Short Answer

Efficiency of Aggregated State in Attention

An attention mechanism calculates the output for the i-th token using the formula: $Att_{output} = \frac{\mathbf{q}'_i \mu_i}{\mathbf{q}'_i \nu_i}$ In this formula, $\mu_i$ and $\nu_i$ are state variables that aggregate information from all tokens from position 1 to $i$ . Explain how computing the output using these two aggregated state variables, rather than by directly comparing the query $\mathbf{q}'_i$ with every individual prior key, contributes to the mechanism's memory efficiency for long sequences.

Updated 2025-10-04

Contributors are:

Who are from:

Learn Before

Related