Concept

Multi-Head Self-Attention Function

The multi-head self-attention function operates on an input representation matrix, HRmimesd\mathbf{H} \in \mathbb{R}^{m imes d}. Rather than using a single set of attention parameters, this mechanism employs hh parallel 'attention heads'. Each head has its own unique set of learnable weight matrices for Query, Key, and Value projections. An attention pooling function ff—such as additive attention or scaled dot-product attention—is applied independently within each head. The outputs from all heads are then concatenated and projected through a final linear transformation to produce the layer's output. Because each head operates in its own learned subspace, different heads may focus on different parts of the input, enabling the model to jointly attend to information from multiple representational subspaces at different positions. This design allows the mechanism to express more sophisticated functions than a simple weighted average.

0

1

Updated 2026-05-14

Tags

Data Science

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.2 Generative Models - Foundations of Large Language Models

D2L

Dive into Deep Learning @ D2L

Learn After