1Cademy - Multi-Head Attention Output Formula (General Vector Form)

Learn Before

Formula

Multi-Head Attention Output Formula (General Vector Form)

In a more general vector-level formulation of multi-head attention (Eq. 11.5.2), the final layer output is obtained by stacking the $h$ individual head outputs $\mathbf{h}_1, \ldots, \mathbf{h}_h$ —each lying in $\mathbb{R}^{p_v}$ —into a single concatenated vector of dimensionality $hp_v$ , and then multiplying by a learnable output projection matrix $\mathbf{W}_o \in \mathbb{R}^{p_o imes hp_v}$ :

$\mathbf{W}_o \begin{bmatrix} \mathbf{h}_1 \vdots \mathbf{h}_h \end{bmatrix} \in \mathbb{R}^{p_o}$

Unlike the matrix-level formulation that fixes the output projection to $\mathbb{R}^{d imes d}$ , this parameterization allows the output dimensionality $p_o$ to differ from both the input dimensionality and the per-head value dimensionality $p_v$ , providing additional architectural flexibility.

0

1

Updated 2026-05-14

Contributors are:

Who are from:

References

Learn Before

Related

Learn After