Concept

Multi-Head Self-Attention Function

The multi-head self-attention function operates on an input representation matrix, HRm×d\mathbf{H} \in \mathbb{R}^{m \times d}. Rather than using a single set of attention parameters, this mechanism employs multiple parallel 'attention heads'. Each head has its own unique set of learnable weight matrices for Query, Key, and Value projections. A scaled dot-product attention operation is performed independently within each head. The outputs from all heads are then concatenated and projected through a final linear transformation to produce the layer's output. This multi-headed approach enables the model to jointly attend to information from different representational subspaces at different positions.

0

1

Updated 2026-04-23

Tags

Data Science

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.2 Generative Models - Foundations of Large Language Models

Learn After