1Cademy - Shape of Key Weight Matrix per Head

Learn Before

Query, Key, and Value Projections in Multi-Head Attention

Formula

Shape of Key Weight Matrix per Head

In a multi-head attention mechanism, the key weight matrix for an individual attention head, which can be denoted as $k_h^k$ , has a specific shape defined as $d \times \frac{d_h}{M}$ . In this formula, $d$ is the dimension of the input representation, $d_h$ is the total dimension of the key projection across all heads, and $M$ is the number of attention heads.