1Cademy - Rationale for Causal Mask Values

Learn Before

Causal Attention Mask Matrix Definition

Short Answer

Rationale for Causal Mask Values

In a self-attention mechanism designed for sequential data processing (like generating text), a mask matrix is added to the raw attention scores before a normalization step. This matrix uses values of 0 for positions a token is allowed to attend to, and negative infinity (-∞) for positions it is forbidden from attending to. Explain precisely why negative infinity is used for the forbidden positions and what effect this has on the final, normalized attention weights.

Updated 2025-10-03

Contributors are:

Who are from:

Learn Before

Related