1Cademy - Formula for Single-Head Self-Attention

Learn Before

Layer-wise Processing in Transformer Inference

Formula

Formula for Single-Head Self-Attention

The formula for single-head self-attention calculates the output for a single query vector $\mathbf{q}_{i'}$ based on a set of key-value pairs. The formula is: $\text{Att}_{\text{qkv}}(\mathbf{q}_{i'}, \mathbf{K}, \mathbf{V}) = \text{Softmax}(\frac{\mathbf{q}_{i'}\mathbf{K}^{\text{T}}}{\sqrt{d}}) \mathbf{V}$ In this equation, the value matrix $\mathbf{V}$ is an element of the set of real-numbered matrices with dimensions $i' \times d$ , expressed as $\mathbf{V} \in \mathbb{R}^{i' \times d}$ . Here, $i'$ represents the number of key-value pairs in the sequence, and $d$ is the dimension of each value vector. The overall process involves computing the dot product of the query with all keys, scaling by the square root of the key dimension, applying the Softmax function to obtain attention weights, and finally, computing a weighted sum of the value vectors.

0

1

Updated 2026-05-03

Contributors are:

Who are from:

References

Learn Before

Related

Learn After