Formula

Individual Attention Head Computation (General Vector Form)

In the general vector-level formulation of multi-head attention (Eq. 11.5.1), the ii-th attention head output hi\mathbf{h}_i (for i=1,,hi = 1, \ldots, h) is computed by first projecting a query qRdq\mathbf{q} \in \mathbb{R}^{d_q}, a key kRdk\mathbf{k} \in \mathbb{R}^{d_k}, and a value vRdv\mathbf{v} \in \mathbb{R}^{d_v} through head-specific learnable weight matrices, and then applying an attention pooling function ff:

hi=f(Wi(q)q,  Wi(k)k,  Wi(v)v)Rpv\mathbf{h}_i = f(\mathbf{W}_i^{(q)} \mathbf{q},\; \mathbf{W}_i^{(k)} \mathbf{k},\; \mathbf{W}_i^{(v)} \mathbf{v}) \in \mathbb{R}^{p_v}

Here, Wi(q)Rpqimesdq\mathbf{W}_i^{(q)} \in \mathbb{R}^{p_q imes d_q}, Wi(k)Rpkimesdk\mathbf{W}_i^{(k)} \in \mathbb{R}^{p_k imes d_k}, and Wi(v)Rpvimesdv\mathbf{W}_i^{(v)} \in \mathbb{R}^{p_v imes d_v} are learnable parameter matrices that project the original representations into subspaces of dimensions pqp_q, pkp_k, and pvp_v respectively. The function ff denotes the attention pooling operation, such as additive attention or scaled dot-product attention.

Image 0

0

1

Updated 2026-05-14

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

D2L

Dive into Deep Learning @ D2L