1Cademy - Masked QKV Attention Formula

Learn Before

Query, Key, and Value in Attention Mechanisms

Formula

Masked QKV Attention Formula

In a self-attention sub-layer, the computation is generally expressed as Query-Key-Value (QKV) attention. When incorporating a masking variable $\mathbf{Mask}$ to ensure the model only considers previous tokens during prediction, the attention is calculated using the formula: $\mathrm{Att}_{\mathrm{qkv}}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \mathrm{Softmax}\left(\frac{\mathbf{Q} \mathbf{K}^{\mathrm{T}}}{\sqrt{d}} + \mathbf{Mask}\right) \mathbf{V}$ where $\mathbf{Q}$ , $\mathbf{K}$ , and $\mathbf{V} \in \mathbb{R}^{m \times d}$ represent the queries, keys, and values matrices, respectively.