Learn Before
Relationship Between Head and Model Dimensions
A transformer model has an overall embedding dimension, let's call it d_model. Inside this model, a multi-head attention layer is configured with a certain number of parallel attention heads, let's call this number τ. Each of these individual heads produces an output vector with its own dimension, d_h. Describe the mathematical relationship between d_model, τ, and d_h. Furthermore, explain why this specific relationship is crucial for integrating the multi-head attention layer's final output back into the model's subsequent layers.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Multi-Head Attention Output Calculation
In a multi-head attention mechanism, the model's overall embedding dimension is 768. If this mechanism is configured with 12 separate, parallel attention heads, what is the dimension of the output vector produced by a single one of these heads?
Relationship Between Head and Model Dimensions
In a multi-head attention mechanism where the overall model dimension is
d_modeland there areτparallel attention heads (whereτ > 1), the output vector of a single attention head has a dimension ofd_model.