Learn Before
Formula

Query, Key, and Value Projections in Multi-Head Attention

In a multi-head attention mechanism, the queries, keys, and values for the jj-th attention head are obtained by projecting the input representation H\mathbf{H} into different subspaces via linear transformations. These transformations utilize unique learnable parameter matrices for each head. The projections are defined as follows:

Q[j]=HWjq\mathbf{Q}^{[j]} = \mathbf{H} \mathbf{W}_j^{q} K[j]=HWjk\mathbf{K}^{[j]} = \mathbf{H} \mathbf{W}_j^{k} V[j]=HWjv\mathbf{V}^{[j]} = \mathbf{H} \mathbf{W}_j^{v}

Here, Wjq\mathbf{W}_j^{q}, Wjk\mathbf{W}_j^{k}, and WjvRd×dτ\mathbf{W}_j^{v} \in \mathbb{R}^{d \times \frac{d}{\tau}} denote the parameter matrices of the transformations for the jj-th head.

Image 0

0

1

Updated 2026-04-19

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related