Learn Before
Applying a Causal Mask to Attention Scores
A language model is processing a sequence of 4 tokens. After calculating the scaled dot-product scores between queries and keys, it produces the following 4x4 matrix of pre-softmax attention scores:
[[2.1, 1.5, 0.8, 1.2], [1.8, 2.5, 1.1, 0.9], [0.7, 1.3, 2.2, 1.9], [1.4, 0.6, 1.7, 2.8]]
To ensure that the model only considers previous tokens when making a prediction, a causal mask is added to this matrix before the final softmax normalization step. Based on this process, what will be the value of the element at row index 1, column index 2 (the value 1.1) after the causal mask is applied, and why?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Role of Causal Attention in Autoregressive Language Models
Causal Attention Output for a Single Token
Visualization of Query-Key Dot Products in Causal Attention
An autoregressive model calculates a square attention weight matrix using the formula:
Softmax((QK^T / sqrt(d)) + Mask). The purpose of theMaskcomponent is to prevent any token from attending to subsequent tokens in the sequence. Which statement best describes the resulting attention weight matrix?An autoregressive model is processing a sequence of 4 tokens. To ensure that the prediction for any given token is based only on the tokens that came before it and the token itself, a specific structure is imposed on the attention weight matrix. Which of the following 4x4 matrices correctly illustrates this structure, where 'α' represents a calculated, non-zero attention weight and '0' represents a weight that has been forcibly set to zero?
Applying a Causal Mask to Attention Scores