1Cademy - Individual Attention Head Computation (General Vector Form)

Learn Before

Query, Key, and Value Projections in Multi-Head Attention

Formula

Individual Attention Head Computation (General Vector Form)

In the general vector-level formulation of multi-head attention (Eq. 11.5.1), the $i$ -th attention head output $\mathbf{h}_i$ (for $i = 1, \ldots, h$ ) is computed by first projecting a query $\mathbf{q} \in \mathbb{R}^{d_q}$ , a key $\mathbf{k} \in \mathbb{R}^{d_k}$ , and a value $\mathbf{v} \in \mathbb{R}^{d_v}$ through head-specific learnable weight matrices, and then applying an attention pooling function $f$ :

$\mathbf{h}_i = f(\mathbf{W}_i^{(q)} \mathbf{q},\; \mathbf{W}_i^{(k)} \mathbf{k},\; \mathbf{W}_i^{(v)} \mathbf{v}) \in \mathbb{R}^{p_v}$

Here, $\mathbf{W}_i^{(q)} \in \mathbb{R}^{p_q imes d_q}$ , $\mathbf{W}_i^{(k)} \in \mathbb{R}^{p_k imes d_k}$ , and $\mathbf{W}_i^{(v)} \in \mathbb{R}^{p_v imes d_v}$ are learnable parameter matrices that project the original representations into subspaces of dimensions $p_q$ , $p_k$ , and $p_v$ respectively. The function $f$ denotes the attention pooling operation, such as additive attention or scaled dot-product attention.

0

1

Updated 2026-05-14

Contributors are:

Who are from:

References

Learn Before

Related

Learn After