Learn Before
Calculating Weight Matrix Dimensions in a Multi-Head Attention Layer
An attention mechanism is configured with 12 parallel processing heads. It receives input data where each element has a dimension of 768. The total dimension for the 'key' projection, when combined across all 12 heads, is also 768. What are the dimensions of the specific weight matrix used to compute the 'key' for a single head? (Format your answer as 'rows x columns', e.g., '100 x 50').
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
In a neural network component that uses parallel processing 'channels' to analyze input, an input representation with a dimension of 512 is transformed. This transformation is split across 8 parallel channels. For the 'key' transformation, the total dimension across all 8 channels is also 512. What is the shape of the learnable weight matrix used for the 'key' transformation within a single one of these channels?
Debugging a Dimensionality Mismatch
Calculating Weight Matrix Dimensions in a Multi-Head Attention Layer