Concept

Dense Attention Assumption

In the original version of self-attention, the attention weights are assumed to be dense. This means that for a given query at position ii, most of the values in the attention weight vector [αi,0...αi,i]\begin{bmatrix} \alpha_{i,0} & ... & \alpha_{i,i} \end{bmatrix} are non-zero. Consequently, the query must compute its output by attending to nearly all key-value pairs up to position ii.

0

1

Updated 2026-04-22

Contributors are:

Who are from:

Tags

Foundations of Large Language Models

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences