1Cademy - Masked Self-Attention in Transformer Decoders

Learn Before

Transformer Decoder

Concept

Masked Self-Attention in Transformer Decoders

Masked self-attention is a crucial component of the Transformer decoder, enabling autoregressive text generation. Unlike standard self-attention, it restricts each position from attending to subsequent, or 'future,' positions in the sequence. This is implemented by applying a mask to the attention scores before the softmax function, effectively zeroing out the weights for future tokens. Consequently, the query for a given token can only interact with keys from its own position and all preceding positions, ensuring that the prediction for the current step depends only on the known past.