Formula

Shape of Key Weight Matrix per Head

In a multi-head attention mechanism, the key weight matrix for an individual attention head, which can be denoted as khkk_h^k, has a specific shape defined as d×dhMd \times \frac{d_h}{M}. In this formula, dd is the dimension of the input representation, dhd_h is the total dimension of the key projection across all heads, and MM is the number of attention heads.

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences