Individual Attention Head Computation (General Vector Form)
In the general vector-level formulation of multi-head attention (Eq. 11.5.1), the -th attention head output (for ) is computed by first projecting a query , a key , and a value through head-specific learnable weight matrices, and then applying an attention pooling function :
Here, , , and are learnable parameter matrices that project the original representations into subspaces of dimensions , , and respectively. The function denotes the attention pooling operation, such as additive attention or scaled dot-product attention.

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
D2L
Dive into Deep Learning @ D2L
Related
Shape of Key Weight Matrix per Head
Shape of Key Weight Sub-Matrix per Head
In a multi-head attention mechanism with 'M' heads, an engineer makes an implementation error. Instead of creating a unique set of learnable weight matrices for the query, key, and value projections for each of the 'M' heads, the same single set of query, key, and value weight matrices is shared across all heads. What is the primary consequence of this error on the model's functionality?
Rationale for Unique Projections in Multi-Head Attention
Attention Head Specialization
Individual Attention Head Computation (General Vector Form)
Parametrization Cost Control in Multi-Head Attention
Learn After
Causal Attention Output for a Single Head and Token
In a multi-head attention mechanism, each individual attention head computes its output using its own unique Query, Key, and Value matrices, which are distinct linear projections of the same input. What is the primary functional consequence of this design choice?
Debugging an Attention Head
Dimensionality of an Attention Head Output
You are examining the computation for a single attention head within a multi-head attention layer. Arrange the following steps in the correct chronological order to produce the output for this individual head.
Autoregressive Individual Attention Head Computation
Multi-Head Attention Output Formula (General Vector Form)