Formula

Masked QKV Attention Formula

In a self-attention sub-layer, the computation is generally expressed as Query-Key-Value (QKV) attention. When incorporating a masking variable Mask\mathbf{Mask} to ensure the model only considers previous tokens during prediction, the attention is calculated using the formula: Attqkv(Q,K,V)=Softmax(QKTd+Mask)V\mathrm{Att}_{\mathrm{qkv}}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \mathrm{Softmax}\left(\frac{\mathbf{Q} \mathbf{K}^{\mathrm{T}}}{\sqrt{d}} + \mathbf{Mask}\right) \mathbf{V} where Q\mathbf{Q}, K\mathbf{K}, and V∈RmƗd\mathbf{V} \in \mathbb{R}^{m \times d} represent the queries, keys, and values matrices, respectively.

0

1

Updated 2026-04-19

Contributors are:

Who are from:

Tags

Foundations of Large Language Models

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related