Definition

Causal Attention Mask Matrix Definition

In self-attention mechanisms where queries, keys, and values are represented by matrices Q,K,VRm×d\mathbf{Q}, \mathbf{K}, \mathbf{V} \in \mathbb{R}^{m \times d}, a masking variable is used to ensure that token prediction is based only on preceding tokens. This is achieved with a mask matrix, MaskRm×m\text{Mask} \in \mathbb{R}^{m \times m}. The value of an entry at row i and column k of this matrix is defined as 0 if k ≤ i (allowing attention to current and past positions) and -∞ if k > i (prohibiting attention to future positions). This mask is added to the attention scores before the softmax activation.

Image 0

0

1

Updated 2025-10-07

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.5 Inference - Foundations of Large Language Models