Learn Before
Dense Attention Assumption
In the original version of self-attention, the attention weights are assumed to be dense. This means that for a given query at position , most of the values in the attention weight vector are non-zero. Consequently, the query must compute its output by attending to nearly all key-value pairs up to position .
0
1
Tags
Foundations of Large Language Models
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
In an autoregressive model, the attention output for a token is a weighted sum of the value vectors of itself and all preceding tokens. Consider a sequence of three tokens (at positions 0, 1, and 2). The value vectors are given as v_0 = [1, 2], v_1 = [3, 0], and v_2 = [4, 5]. The attention weights for the token at position 2, which determine the contribution of each token in the context, are α_2,0 = 0.1, α_2,1 = 0.6, and α_2,2 = 0.3. Based on this information, what is the attention output vector for the token at position 2?
Interpreting Causal Attention Output
Debugging a Causal Attention Calculation
Dense Attention Assumption