Learn Before
Visualization of Query-Key Dot Products in Causal Attention
In a causal attention mechanism, a query at a given position is only allowed to attend to keys at the same or preceding positions, preventing information flow from the future. This is implemented by computing dot products only between a query vector and key vectors where the key's index is less than or equal to the query's index . For a sequence of length 7 (indexed 0 to 6), the specific query-key dot products that are calculated are as follows:
- For token 0:
- For token 1:
- For token 2:
- For token 3:
- For token 4:
- For token 5:
- For token 6:
This selective computation results in a lower triangular attention score matrix, which is fundamental to autoregressive models.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Role of Causal Attention in Autoregressive Language Models
Causal Attention Output for a Single Token
Visualization of Query-Key Dot Products in Causal Attention
An autoregressive model calculates a square attention weight matrix using the formula:
Softmax((QK^T / sqrt(d)) + Mask). The purpose of theMaskcomponent is to prevent any token from attending to subsequent tokens in the sequence. Which statement best describes the resulting attention weight matrix?An autoregressive model is processing a sequence of 4 tokens. To ensure that the prediction for any given token is based only on the tokens that came before it and the token itself, a specific structure is imposed on the attention weight matrix. Which of the following 4x4 matrices correctly illustrates this structure, where 'α' represents a calculated, non-zero attention weight and '0' represents a weight that has been forcibly set to zero?
Applying a Causal Mask to Attention Scores
Learn After
An autoregressive model is processing a sequence of 5 tokens, indexed 0 through 4. The model's attention mechanism is constrained so that any given token can only attend to itself and to tokens that appeared earlier in the sequence. Which of the following diagrams correctly visualizes the set of all required dot product calculations between query vectors (q, representing each token's perspective) and key vectors (k, representing each token's content)? An 'X' marks a calculation that is performed.
Total Attention Score Calculations
An autoregressive model is processing a sequence of 6 tokens, indexed 0 through 5. The model uses an attention mechanism where a query from a specific token position can only interact with keys from the same or preceding positions. Match each query vector to the complete set of key vectors it will be multiplied with to calculate attention scores.