1Cademy - Scaled Dot-Product Attention

Learn Before

Formula

Scaled Dot-Product Attention

Scaled dot-product attention is a widely used attention scoring mechanism and a core component of Transformer architectures. It operates on batches of $n$ queries, $m$ key-value pairs, where queries and keys share a feature dimension $d$ and values have dimension $v$ . The matrix formulation is: $\mathrm{softmax}\left(\frac{\mathbf{Q} \mathbf{K}^\top}{\sqrt{d}}\right) \mathbf{V} \in \mathbb{R}^{n \times v}$ where $\mathbf{Q} \in \mathbb{R}^{n \times d}$ , $\mathbf{K} \in \mathbb{R}^{m \times d}$ , and $\mathbf{V} \in \mathbb{R}^{m \times v}$ . The scaling factor $\sqrt{d}$ controls the variance of the dot product scores before softmax normalization. In the general case, queries and keys need not have the same vector length; when they differ, the dot product $\mathbf{q}^\top \mathbf{k}$ can be replaced with $\mathbf{q}^\top \mathbf{M} \mathbf{k}$ , where $\mathbf{M}$ is a suitably chosen matrix for translating between the two spaces. In practice, minibatch computation is handled via batch matrix multiplication, and dropout is applied to the attention weights for regularization before multiplying with the values.

Updated 2026-05-14

Contributors are:

Who are from:

Learn Before

Related

Learn After