Learn Before
Example of Query-Key Interactions in Causal Attention
In a causal self-attention mechanism, the set of calculated query-key dot products explicitly demonstrates the autoregressive nature of the model, where each position can only attend to itself and preceding positions. The following list enumerates all such interactions for a sequence of length 7 (from position 0 to 6), where qi represents the query at position i and kT j represents the transposed key at position j:
- q0:
q0 kT 0 - q1:
q1 kT 0,q1 kT 1 - q2:
q2 kT 0,q2 kT 1,q2 kT 2 - q3:
q3 kT 0,q3 kT 1,q3 kT 2,q3 kT 3 - q4:
q4 kT 0,q4 kT 1,q4 kT 2,q4 kT 3,q4 kT 4 - q5:
q5 kT 0,q5 kT 1,q5 kT 2,q5 kT 3,q5 kT 4,q5 kT 5 - q6:
q6 kT 0,q6 kT 1,q6 kT 2,q6 kT 3,q6 kT 4,q6 kT 5,q6 kT 6
This pattern ensures that the prediction for a token at a given position is not influenced by any future tokens.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Computational Cost per Token in Causal Attention
Reusability of Key-Value Pairs in Autoregressive Inference
Example of Query-Key Interactions in Causal Attention
An autoregressive model is generating a sequence of tokens one by one. It is currently calculating the attention output for the token at position 4 (i.e., the fifth token in the sequence). To ensure the model only uses information it has already seen, which set of key (K) and value (V) vectors must be used as input to the attention mechanism for the query vector at position 4 (q₄)?
Diagnosing Information Leakage in an Autoregressive Model
When calculating the attention output for a specific token at position
iin an autoregressive model, the mechanism is structured to use the query vector from that same position (q_i), while the key and value matrices are composed of the corresponding vectors from all positions in the full input sequence.
Learn After
A self-attention mechanism is configured to ensure that when processing a sequence, the output for any given position
iis influenced only by inputs from positionsjwherej <= i. This prevents the model from 'seeing' future elements. The interaction between the query from positioniand the key from positionjresults in a score. For a sequence of 4 elements (positions 0, 1, 2, 3), which of the following score matrices violates this principle? ('S' indicates a calculated score; '0' indicates a disallowed or masked interaction.)Debugging an Autoregressive Model's Attention
In a self-attention mechanism designed to process information sequentially without looking ahead, a sequence of 8 elements (indexed 0 to 7) is being processed. The query vector for the element at position 5 will be compared against a total of ____ key vectors.