Formula

Self-Attention Formula for the Prefilling Phase

During the prefilling phase, self-attention is computed for the entire input sequence in a single operation. The query, key, and value vectors are represented as matrices Q,K,VRd×(m+1)\mathbf{Q}, \mathbf{K}, \mathbf{V} \in \mathbb{R}^{d \times (m+1)}. The attention output is calculated using the scaled dot-product formula: Attqkv(Q,K,V)=Softmax(QKTd+Mask)V\text{Att}_{\text{qkv}}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{Softmax}\left(\frac{\mathbf{QK}^{\text{T}}}{\sqrt{d}} + \text{Mask}\right)\mathbf{V} Here, the causal mask, MaskR(m+1)×(m+1)\text{Mask} \in \mathbb{R}^{(m+1) \times (m+1)}, prevents tokens from attending to future positions by setting the corresponding entries in the attention score matrix to a large negative number (e.g., -\infty) before the Softmax function is applied.

Image 0

0

1

Updated 2026-05-03

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences