Learn Before
Concept
Self-Attention Masking Variable Matrix
In the masked Query-Key-Value (QKV) attention formula, a masking variable is added to the scaled dot-product of queries and keys before the softmax operation. This matrix ensures that each token only attends to itself and the tokens that precede it in the sequence. Specifically, to mask out future tokens, the entry of the mask corresponding to the query and the key is set to if .
0
1
Updated 2026-05-03
Tags
Foundations of Large Language Models
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.5 Inference - Foundations of Large Language Models