1Cademy - Causal Self-Attention in Autoregressive Decoders

Learn Before

Calculating Attention Weights (αi,j) in Transformers

Concept

Causal Self-Attention in Autoregressive Decoders

In autoregressive decoders, a specific type of self-attention called causal or masked self-attention is employed. This mechanism restricts the model's attention, ensuring that the prediction for a token at position i can only depend on the tokens at previous positions (from 0 to i-1). It is forbidden from attending to tokens at or after position i. This is fundamental to maintaining the autoregressive property of generating a sequence one token at a time, as the model cannot 'see' into the future.