Concept

Causal Self-Attention in Autoregressive Decoders

In autoregressive decoders, a specific type of self-attention called causal or masked self-attention is employed. This mechanism restricts the model's attention, ensuring that the prediction for a token at position i can only depend on the tokens at previous positions (from 0 to i-1). It is forbidden from attending to tokens at or after position i. This is fundamental to maintaining the autoregressive property of generating a sequence one token at a time, as the model cannot 'see' into the future.

Image 0

0

1

Updated 2025-10-06

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences