1Cademy - In a self-attention mechanism processing a sequence of 4 tokens, a mask is added to the raw attention scores to prevent any token from attending to subsequent (future) tokens. Which of the following 4x4 matrices correctly represents this mask?

Learn Before

Causal Attention Mask Matrix Definition

Multiple Choice

In a self-attention mechanism processing a sequence of 4 tokens, a mask is added to the raw attention scores to prevent any token from attending to subsequent (future) tokens. Which of the following 4x4 matrices correctly represents this mask?

Updated 2025-10-07

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.5 Inference - Foundations of Large Language Models

Application in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

In a self-attention mechanism designed for autoregressive tasks, a sequence of 5 tokens is processed. The mechanism computes raw attention scores for each token relative to all other tokens. Before a final normalization step, a mask is added to these scores to prevent any token from attending to future tokens. For the 3rd token in the sequence, which vector correctly represents its scores for all 5 tokens after this causal mask has been applied? (Let s_i denote the original raw score for the 3rd token attending to the i-th token).
Rationale for Causal Mask Values
In a self-attention mechanism processing a sequence of 4 tokens, a mask is added to the raw attention scores to prevent any token from attending to subsequent (future) tokens. Which of the following 4x4 matrices correctly represents this mask?

Learn Before

Related