Multiple Choice

In a Transformer decoder, masked self-attention is used to ensure that the prediction for a token at a given position can only depend on previous tokens. This is achieved by modifying the attention score matrix before the softmax function is applied. For a sequence of tokens, which of the following correctly describes the structure of the attention score matrix after this causal mask has been applied?

0

1

Updated 2025-10-02

Contributors are:

Who are from:

Tags

Data Science

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Application in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science