Formula

Shape of Key Weight Sub-Matrix per Head

In a multi-head attention mechanism, the key weight sub-matrix for an individual attention head, denoted as WhkW_h^k, has a shape of d×dMd \times \frac{d}{M}. This formula applies specifically when the total dimension of the key projection across all heads is equal to the input representation dimension, dd. In this context, MM represents the number of attention heads.

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences