1Cademy - In a Transformer decoder, masked self-attention is used to ensure that the prediction for a token at a given position can only depend on previous tokens. This is achieved by modifying the attention score matrix before the softmax function is applied. For a sequence of tokens, which of the following correctly describes the structure of the attention score matrix after this causal mask has been applied?

Learn Before

Masked Self-Attention in Transformer Decoders

Multiple Choice

In a Transformer decoder, masked self-attention is used to ensure that the prediction for a token at a given position can only depend on previous tokens. This is achieved by modifying the attention score matrix before the softmax function is applied. For a sequence of tokens, which of the following correctly describes the structure of the attention score matrix after this causal mask has been applied?

Updated 2025-10-02

Contributors are:

Who are from:

Learn Before

Related