Analyze the following scenario to identify the fundamental error in the specified weight matrix's dimensions. Explain why it is an error and state what the correct dimensions should be.

Google

In a multi-head attention mechanism, the key weight sub-matrix for an individual attention head, denoted as $W_h^k$, has a shape of $d \times \frac{d}{M}$. This formula applies specifically when the total dimension of the key projection across all heads is equal to the input representation dimension, $d$. In this context, $M$ represents the number of attention heads.

Shape of Key Weight Sub-Matrix per Head

In a neural network component, an input representation of dimension 512 is processed by 8 parallel 'heads'. For each head, a 'key' vector is produced by multiplying the input representation by a specific weight matrix. The dimensions of the 'key' vectors from all heads are concatenated, resulting in a final combined dimension of 512. What is the shape of the weight matrix used to produce the 'key' vector for a single head?

In a multi-head attention mechanism, the input representation has a dimension of 768. The weight matrix used to compute the 'key' vector for a single attention head has a shape of $768 \times 96$. Assuming the total dimension of the 'key' projection across all heads is equal to the input representation dimension, how many attention heads are being used? Explain your reasoning.

Learn Before

Related