Formula

Scaled Dot-Product Attention

Scaled dot-product attention is a widely used attention scoring mechanism and a core component of Transformer architectures. It operates on batches of nn queries, mm key-value pairs, where queries and keys share a feature dimension dd and values have dimension vv. The matrix formulation is: softmax(QKd)VRn×v\mathrm{softmax}\left(\frac{\mathbf{Q} \mathbf{K}^\top}{\sqrt{d}}\right) \mathbf{V} \in \mathbb{R}^{n \times v} where QRn×d\mathbf{Q} \in \mathbb{R}^{n \times d}, KRm×d\mathbf{K} \in \mathbb{R}^{m \times d}, and VRm×v\mathbf{V} \in \mathbb{R}^{m \times v}. The scaling factor d\sqrt{d} controls the variance of the dot product scores before softmax normalization. In the general case, queries and keys need not have the same vector length; when they differ, the dot product qk\mathbf{q}^\top \mathbf{k} can be replaced with qMk\mathbf{q}^\top \mathbf{M} \mathbf{k}, where M\mathbf{M} is a suitably chosen matrix for translating between the two spaces. In practice, minibatch computation is handled via batch matrix multiplication, and dropout is applied to the attention weights for regularization before multiplying with the values.

Image 0

0

1

Updated 2026-05-14

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.5 Inference - Foundations of Large Language Models

D2L

Dive into Deep Learning @ D2L

Related
Learn After