Learn Before
Causal Attention Output for a Single Token
In autoregressive language models, next tokens are predicted based solely on their preceding context (the 'left-context'). Accordingly, the output of the attention mechanism for a single token at position is calculated using only information from tokens to . This output is formulated as the product of the attention weight row vector for token and the matrix of corresponding value vectors up to that position:
This matrix multiplication is equivalent to the weighted sum of the value vectors:
In these equations, the keys and values up to position are respectively defined as the matrices and .

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Role of Causal Attention in Autoregressive Language Models
Causal Attention Output for a Single Token
Visualization of Query-Key Dot Products in Causal Attention
An autoregressive model calculates a square attention weight matrix using the formula:
Softmax((QK^T / sqrt(d)) + Mask). The purpose of theMaskcomponent is to prevent any token from attending to subsequent tokens in the sequence. Which statement best describes the resulting attention weight matrix?An autoregressive model is processing a sequence of 4 tokens. To ensure that the prediction for any given token is based only on the tokens that came before it and the token itself, a specific structure is imposed on the attention weight matrix. Which of the following 4x4 matrices correctly illustrates this structure, where 'α' represents a calculated, non-zero attention weight and '0' represents a weight that has been forcibly set to zero?
Applying a Causal Mask to Attention Scores
Learn After
In an autoregressive model, the attention output for a token is a weighted sum of the value vectors of itself and all preceding tokens. Consider a sequence of three tokens (at positions 0, 1, and 2). The value vectors are given as v_0 = [1, 2], v_1 = [3, 0], and v_2 = [4, 5]. The attention weights for the token at position 2, which determine the contribution of each token in the context, are α_2,0 = 0.1, α_2,1 = 0.6, and α_2,2 = 0.3. Based on this information, what is the attention output vector for the token at position 2?
Interpreting Causal Attention Output
Debugging a Causal Attention Calculation
Dense Attention Assumption