Learn Before
Debugging a Causal Attention Calculation
A developer is implementing an autoregressive model and finds a bug. For a sequence of 4 tokens (indexed 0, 1, 2, 3), the attention output for the token at position 2 is being calculated using the following weighted sum:
Output_at_pos_2 = (0.1 * v_0) + (0.2 * v_1) + (0.6 * v_2) + (0.1 * v_3)
where v_j is the value vector for the token at position j. Identify the specific term in this expression that violates the core principle of this type of attention mechanism and explain why its inclusion is incorrect for an autoregressive task.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
In an autoregressive model, the attention output for a token is a weighted sum of the value vectors of itself and all preceding tokens. Consider a sequence of three tokens (at positions 0, 1, and 2). The value vectors are given as v_0 = [1, 2], v_1 = [3, 0], and v_2 = [4, 5]. The attention weights for the token at position 2, which determine the contribution of each token in the context, are α_2,0 = 0.1, α_2,1 = 0.6, and α_2,2 = 0.3. Based on this information, what is the attention output vector for the token at position 2?
Interpreting Causal Attention Output
Debugging a Causal Attention Calculation
Dense Attention Assumption