Learn Before
Dimensionality of an Attention Head Output
In a multi-head attention mechanism, the output of each individual attention head, denoted as , is a vector. This vector belongs to a real-valued vector space of dimension , which is represented by the notation:

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Multi-Head Attention Output Calculation
Causal Attention Output for a Single Head and Token
In a multi-head attention mechanism, each individual attention head computes its output using its own unique Query, Key, and Value matrices, which are distinct linear projections of the same input. What is the primary functional consequence of this design choice?
Debugging an Attention Head
Dimensionality of an Attention Head Output
You are examining the computation for a single attention head within a multi-head attention layer. Arrange the following steps in the correct chronological order to produce the output for this individual head.
Autoregressive Individual Attention Head Computation
Learn After
Multi-Head Attention Output Calculation
In a multi-head attention mechanism, the model's overall embedding dimension is 768. If this mechanism is configured with 12 separate, parallel attention heads, what is the dimension of the output vector produced by a single one of these heads?
Relationship Between Head and Model Dimensions
In a multi-head attention mechanism where the overall model dimension is
d_modeland there areτparallel attention heads (whereτ > 1), the output vector of a single attention head has a dimension ofd_model.