Learn Before
Multiple Choice

In a self-attention mechanism designed for autoregressive tasks, a sequence of 5 tokens is processed. The mechanism computes raw attention scores for each token relative to all other tokens. Before a final normalization step, a mask is added to these scores to prevent any token from attending to future tokens. For the 3rd token in the sequence, which vector correctly represents its scores for all 5 tokens after this causal mask has been applied? (Let s_i denote the original raw score for the 3rd token attending to the i-th token).

0

1

Updated 2025-09-26

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.5 Inference - Foundations of Large Language Models

Application in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science