Short Answer

Consequences of Removing the Causal Mask

In the context of the self-attention formula used during the prefilling phase, Att(Q, K, V) = Softmax((QK^T / sqrt(d)) + Mask)V, what would be the direct consequence on the model's information flow if the Mask term were omitted (i.e., treated as a matrix of all zeros)? Explain why this outcome is fundamentally incompatible with the goal of training an auto-regressive language model.

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Application in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science