Learn Before
Example of Causal Attention Dot Products
In a causal self-attention mechanism, a query at a given position i is restricted to attending only to keys at positions j that are less than or equal to i (). This ensures that the prediction for a token depends only on the preceding tokens. The following list illustrates the specific query-key dot products that are computed for a sequence, demonstrating the lower-triangular pattern of attention scores before masking:
- For query :
- For query :
- For query :
- For query :
This pattern continues for the entire sequence length.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Next-Token Probability Calculation in Autoregressive Decoders
Enumeration of Dot Products in Causal Self-Attention
A language model is designed to generate text one token at a time, predicting the next token based only on the ones that came before it. The image below shows four possible heatmaps (A, B, C, D) representing the attention scores between tokens in a 4-token sequence. The token making the query is on the vertical axis, and the token providing the key is on the horizontal axis. A darker square indicates that a query token is paying more attention to a key token. Which heatmap correctly illustrates the attention pattern required for this type of sequential generation model to function correctly?
[Image containing four 4x4 heatmaps labeled A, B, C, and D. A: A lower-triangular matrix, dark on and below the main diagonal. B: A full matrix, all squares are dark. C: An upper-triangular matrix, dark on and above the main diagonal. D: A diagonal matrix, dark only on the main diagonal.]
Debugging a Generative Language Model
Example of Causal Attention Dot Products
Choosing the Right Attention Mechanism
Learn After
In a self-attention mechanism designed for generating text one token at a time, the calculation for a token at a specific position must only depend on the tokens that came before it and the token at the current position. For a sequence of 5 tokens (indexed 0 to 4), which of the following dot product calculations between a query vector (q) and a key vector (k) would be disallowed to maintain this property?
In a self-attention mechanism where the output for any given position can only depend on inputs at the current and preceding positions, consider a sequence of 8 tokens (indexed 0 to 7). The query vector for the final token in the sequence will be multiplied with a total of ___ key vectors.
In a language model that generates text sequentially, the attention mechanism ensures that the prediction for a token only depends on the tokens that have come before it, including itself. For a sequence of 6 tokens (indexed 0 to 5), which of the following lists represents the complete set of dot products that must be computed for the query vector at position 3 (q₃)?