Causal Self-Attention in Autoregressive Decoders
In autoregressive decoders, a specific type of self-attention called causal or masked self-attention is employed. This mechanism restricts the model's attention, ensuring that the prediction for a token at position i can only depend on the tokens at previous positions (from 0 to i-1). It is forbidden from attending to tokens at or after position i. This is fundamental to maintaining the autoregressive property of generating a sequence one token at a time, as the model cannot 'see' into the future.

0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Scaled Dot-Product Attention
Causal Self-Attention in Autoregressive Decoders
A model is processing a sequence of three tokens. For the query at position 2, the un-normalized attention scores with respect to the keys at positions 0, 1, and 2 are calculated as [1.0, 2.0, 3.0] respectively. What is the final attention weight that the token at position 2 will assign to the token at position 1?
Attention Output as a Weighted Sum of Values
Impact of Masking on Attention Weight Distribution
True or False: In a self-attention mechanism, if you add the same constant value to all un-normalized attention scores corresponding to a single query vector, the final normalized attention weights for that query will change.
Attention Weight Formula ()
Learn After
Next-Token Probability Calculation in Autoregressive Decoders
Enumeration of Dot Products in Causal Self-Attention
A language model is designed to generate text one token at a time, predicting the next token based only on the ones that came before it. The image below shows four possible heatmaps (A, B, C, D) representing the attention scores between tokens in a 4-token sequence. The token making the query is on the vertical axis, and the token providing the key is on the horizontal axis. A darker square indicates that a query token is paying more attention to a key token. Which heatmap correctly illustrates the attention pattern required for this type of sequential generation model to function correctly?
[Image containing four 4x4 heatmaps labeled A, B, C, and D. A: A lower-triangular matrix, dark on and below the main diagonal. B: A full matrix, all squares are dark. C: An upper-triangular matrix, dark on and above the main diagonal. D: A diagonal matrix, dark only on the main diagonal.]
Debugging a Generative Language Model
Example of Causal Attention Dot Products
Choosing the Right Attention Mechanism