1Cademy - Multi-Head Self-Attention Function

Learn Before

Concept

Multi-Head Self-Attention Function

The multi-head self-attention function operates on an input representation matrix, $\mathbf{H} \in \mathbb{R}^{m imes d}$ . Rather than using a single set of attention parameters, this mechanism employs $h$ parallel 'attention heads'. Each head has its own unique set of learnable weight matrices for Query, Key, and Value projections. An attention pooling function $f$ —such as additive attention or scaled dot-product attention—is applied independently within each head. The outputs from all heads are then concatenated and projected through a final linear transformation to produce the layer's output. Because each head operates in its own learned subspace, different heads may focus on different parts of the input, enabling the model to jointly attend to information from multiple representational subspaces at different positions. This design allows the mechanism to express more sophisticated functions than a simple weighted average.