Concept

Self-Attention Masking Variable Matrix

In the masked Query-Key-Value (QKV) attention formula, a masking variable MaskR(m+1)×(m+1)\mathbf{Mask} \in \mathbb{R}^{(m+1) \times (m+1)} is added to the scaled dot-product of queries and keys before the softmax operation. This matrix ensures that each token only attends to itself and the tokens that precede it in the sequence. Specifically, to mask out future tokens, the entry (i,j)(i,j) of the mask corresponding to the query qi\mathbf{q}_{i} and the key kj\mathbf{k}_{j} is set to -\infty if i<ji < j.

0

1

Updated 2026-05-03

Contributors are:

Who are from:

Tags

Foundations of Large Language Models

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.5 Inference - Foundations of Large Language Models