Enumeration of Dot Products in Causal Self-Attention
In a causal self-attention mechanism, the query vector for a given token i, denoted as , only computes dot products with the key vectors of tokens up to and including its own position. This ensures that the prediction for token i does not depend on future tokens. The sequence of dot products computed is as follows:
- For token 0:
- For token 1:
- For token 2:
This pattern continues for subsequent tokens, where for any token
i, the dot products are calculated for all .
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Causal Attention Input Structure
Enumeration of Dot Products in Causal Self-Attention
State Variables in Linear Attention (μ_i, ν_i)
In an autoregressive attention mechanism, a sequence of key vectors is generated. Given the first three key vectors
k_0 = [1, 2],k_1 = [3, 4], andk_2 = [5, 6], which of the following matrices represents the complete set of keys that the query at positioni=2is allowed to interact with?Debugging a Causal Attention Implementation
In an autoregressive attention mechanism processing a sequence of 10 tokens (indexed 0 to 9), the matrix of key vectors used to compute the output for the token at position 3 is identical to the matrix of key vectors used for the token at position 7.
Next-Token Probability Calculation in Autoregressive Decoders
Enumeration of Dot Products in Causal Self-Attention
A language model is designed to generate text one token at a time, predicting the next token based only on the ones that came before it. The image below shows four possible heatmaps (A, B, C, D) representing the attention scores between tokens in a 4-token sequence. The token making the query is on the vertical axis, and the token providing the key is on the horizontal axis. A darker square indicates that a query token is paying more attention to a key token. Which heatmap correctly illustrates the attention pattern required for this type of sequential generation model to function correctly?
[Image containing four 4x4 heatmaps labeled A, B, C, and D. A: A lower-triangular matrix, dark on and below the main diagonal. B: A full matrix, all squares are dark. C: An upper-triangular matrix, dark on and above the main diagonal. D: A diagonal matrix, dark only on the main diagonal.]
Debugging a Generative Language Model
Example of Causal Attention Dot Products
Choosing the Right Attention Mechanism
Learn After
Explicit Enumeration of Causal Self-Attention Dot Products
An autoregressive model processes a sequence of tokens, where the query for a given token
i(denoted ) can only interact with key vectors from positionsjwherej ≤ i. For the 4th token in a sequence (indexed as ), which of the following dot product computations would not be performed?In a self-attention mechanism where the prediction for a token at position
ican only depend on tokens from positions 0 up to and includingi, what is the total number of query-key dot products computed for an entire input sequence of 5 tokens (indexed 0 to 4)?An autoregressive model processes a sequence of tokens, where the query for a given token
i(denoted as ) can only interact with key vectors from positionsjwherej ≤ i. Match each query vector with the complete set of dot products it computes.