Learn Before
Concept

Masked Self-Attention in Transformer Decoders

Masked self-attention is a crucial component of the Transformer decoder, enabling autoregressive text generation. Unlike standard self-attention, it restricts each position from attending to subsequent, or 'future,' positions in the sequence. This is implemented by applying a mask to the attention scores before the softmax function, effectively zeroing out the weights for future tokens. Consequently, the query for a given token can only interact with keys from its own position and all preceding positions, ensuring that the prediction for the current step depends only on the known past.

Image 0

0

1

Updated 2026-04-19

Tags

Data Science

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.2 Generative Models - Foundations of Large Language Models