Multi-Head Attention Output Formula (General Vector Form)
In a more general vector-level formulation of multi-head attention (Eq. 11.5.2), the final layer output is obtained by stacking the individual head outputs —each lying in —into a single concatenated vector of dimensionality , and then multiplying by a learnable output projection matrix :
Unlike the matrix-level formulation that fixes the output projection to , this parameterization allows the output dimensionality to differ from both the input dimensionality and the per-head value dimensionality , providing additional architectural flexibility.

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
D2L
Dive into Deep Learning @ D2L
Related
In a multi-head attention mechanism, the model's overall embedding dimension is 768. If this mechanism is configured with 12 separate, parallel attention heads, what is the dimension of the output vector produced by a single one of these heads?
Relationship Between Head and Model Dimensions
In a multi-head attention mechanism where the overall model dimension is
d_modeland there areτparallel attention heads (whereτ > 1), the output vector of a single attention head has a dimension ofd_model.Multi-Head Attention Output Formula (General Vector Form)
Causal Attention Output for a Single Head and Token
In a multi-head attention mechanism, each individual attention head computes its output using its own unique Query, Key, and Value matrices, which are distinct linear projections of the same input. What is the primary functional consequence of this design choice?
Debugging an Attention Head
Dimensionality of an Attention Head Output
You are examining the computation for a single attention head within a multi-head attention layer. Arrange the following steps in the correct chronological order to produce the output for this individual head.
Autoregressive Individual Attention Head Computation
Multi-Head Attention Output Formula (General Vector Form)
Learn After
A multi-head attention layer in a model has 8 parallel attention heads. For a single input token, the output from each of these 8 heads is a vector with 64 dimensions. The mechanism's next step is to concatenate these 8 vectors into a single, larger vector. This larger vector is then multiplied by a final weight matrix to produce the layer's final output vector for that token. What is the dimensionality of the single vector that results from the concatenation step, before the final matrix multiplication is applied?
After each parallel attention head has computed its individual output vector, what is the correct sequence of operations to produce the final output of the multi-head attention layer?
Determining Weight Matrix Dimensions in Multi-Head Attention