1Cademy - Consequences of Removing the Causal Mask

Learn Before

Self-Attention Formula for the Prefilling Phase

Short Answer

Consequences of Removing the Causal Mask

In the context of the self-attention formula used during the prefilling phase, Att(Q, K, V) = Softmax((QK^T / sqrt(d)) + Mask)V, what would be the direct consequence on the model's information flow if the Mask term were omitted (i.e., treated as a matrix of all zeros)? Explain why this outcome is fundamentally incompatible with the goal of training an auto-regressive language model.

Updated 2025-10-08

Contributors are:

Who are from:

Learn Before

Related