Learn Before
Rationale for Causal Mask Values
In a self-attention mechanism designed for sequential data processing (like generating text), a mask matrix is added to the raw attention scores before a normalization step. This matrix uses values of 0 for positions a token is allowed to attend to, and negative infinity (-∞) for positions it is forbidden from attending to. Explain precisely why negative infinity is used for the forbidden positions and what effect this has on the final, normalized attention weights.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.5 Inference - Foundations of Large Language Models
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
In a self-attention mechanism designed for autoregressive tasks, a sequence of 5 tokens is processed. The mechanism computes raw attention scores for each token relative to all other tokens. Before a final normalization step, a mask is added to these scores to prevent any token from attending to future tokens. For the 3rd token in the sequence, which vector correctly represents its scores for all 5 tokens after this causal mask has been applied? (Let
s_idenote the original raw score for the 3rd token attending to thei-th token).Rationale for Causal Mask Values
In a self-attention mechanism processing a sequence of 4 tokens, a mask is added to the raw attention scores to prevent any token from attending to subsequent (future) tokens. Which of the following 4x4 matrices correctly represents this mask?