Formula

Scaled Dot-Product Attention

Scaled Dot-Product Attention is a core component of Transformer models and a specific implementation of the Query-Key-Value (QKV) attention paradigm. Its operation is defined by the formula: Attqkv(Q,K,V)=Softmax(QKTd+Mask)V\text{Att}_{\text{qkv}}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{Softmax}\left(\frac{\mathbf{QK}^{\text{T}}}{\sqrt{d}} + \text{Mask}\right)\mathbf{V} In this formula:

  • Q\mathbf{Q} (Queries), K\mathbf{K} (Keys), and V\mathbf{V} (Values) are the input matrices, where Q,K,VRm×d\mathbf{Q}, \mathbf{K}, \mathbf{V} \in \mathbb{R}^{m \times d}.
  • The attention scores are calculated via the dot product of the Query and transposed Key matrices (QKT\mathbf{QK}^{\text{T}}).
  • These scores are scaled by the square root of the key vector dimension, d\sqrt{d}, to maintain stable gradients during training.
  • An optional Mask matrix is added to the scaled scores. This is crucial in settings where attention should be restricted, such as preventing a position from attending to subsequent positions in autoregressive tasks.
  • The Softmax function normalizes the scores into attention weights (probabilities).
  • The final output is a weighted sum of the Value vectors based on these weights.
Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.5 Inference - Foundations of Large Language Models

Related