1Cademy - Dense Attention Assumption

Learn Before

Causal Attention Output for a Single Token

Concept

Dense Attention Assumption

In the original version of self-attention, the attention weights are assumed to be dense. This means that for a given query at position $i$ , most of the values in the attention weight vector $\begin{bmatrix} \alpha_{i,0} & ... & \alpha_{i,i} \end{bmatrix}$ are non-zero. Consequently, the query must compute its output by attending to nearly all key-value pairs up to position $i$ .