Formula

Multi-Head Attention Output Formula (General Vector Form)

In a more general vector-level formulation of multi-head attention (Eq. 11.5.2), the final layer output is obtained by stacking the hh individual head outputs h1,,hh\mathbf{h}_1, \ldots, \mathbf{h}_h—each lying in Rpv\mathbb{R}^{p_v}—into a single concatenated vector of dimensionality hpvhp_v, and then multiplying by a learnable output projection matrix WoRpoimeshpv\mathbf{W}_o \in \mathbb{R}^{p_o imes hp_v}:

Wo[h1  hh]Rpo\mathbf{W}_o \begin{bmatrix} \mathbf{h}_1 \ \vdots \ \mathbf{h}_h \end{bmatrix} \in \mathbb{R}^{p_o}

Unlike the matrix-level formulation that fixes the output projection to Rdimesd\mathbb{R}^{d imes d}, this parameterization allows the output dimensionality pop_o to differ from both the input dimensionality and the per-head value dimensionality pvp_v, providing additional architectural flexibility.

Image 0

0

1

Updated 2026-05-14

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

D2L

Dive into Deep Learning @ D2L