In the masked Query-Key-Value (QKV) attention formula, a masking variable $$\mathbf{Mask} \in \mathbb{R}^{(m+1) \times (m+1)}$$ is added to the scaled dot-product of queries and keys before the softmax operation. This matrix ensures that each token only attends to itself and the tokens that precede it in the sequence. Specifically, to mask out future tokens, the entry $$(i,j)$$ of the mask corresponding to the query $$\mathbf{q}_{i}$$ and the key $$\mathbf{k}_{j}$$ is set to $$-\infty$$ if $$i < j$$.

Google

In a self-attention sub-layer, the computation is generally expressed as Query-Key-Value (QKV) attention. When incorporating a masking variable $$\mathbf{Mask}$$ to ensure the model only considers previous tokens during prediction, the attention is calculated using the formula: $$\mathrm{Att}_{\mathrm{qkv}}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \mathrm{Softmax}\left(\frac{\mathbf{Q} \mathbf{K}^{\mathrm{T}}}{\sqrt{d}} + \mathbf{Mask}\right) \mathbf{V}$$ where $$\mathbf{Q}$$, $$\mathbf{K}$$, and $$\mathbf{V} \in \mathbb{R}^{m \times d}$$ represent the queries, keys, and values matrices, respectively.

Learn Before

Related