1Cademy - Shape of Key Weight Sub-Matrix per Head

Learn Before

Query, Key, and Value Projections in Multi-Head Attention

Formula

Shape of Key Weight Sub-Matrix per Head

In a multi-head attention mechanism, the key weight sub-matrix for an individual attention head, denoted as $W_h^k$ , has a shape of $d \times \frac{d}{M}$ . This formula applies specifically when the total dimension of the key projection across all heads is equal to the input representation dimension, $d$ . In this context, $M$ represents the number of attention heads.