Based on the formula for an individual attention head, `head_j = Att_qkv(Q^[j], K^[j], V^[j])`, what is the most probable cause of the issue described in the case study?

Google

In the general vector-level formulation of multi-head attention (Eq. 11.5.1), the $$i$$-th attention head output $$\mathbf{h}_i$$ (for $$i = 1, \ldots, h$$) is computed by first projecting a query $$\mathbf{q} \in \mathbb{R}^{d_q}$$, a key $$\mathbf{k} \in \mathbb{R}^{d_k}$$, and a value $$\mathbf{v} \in \mathbb{R}^{d_v}$$ through head-specific learnable weight matrices, and then applying an attention pooling function $$f$$:

$$\mathbf{h}_i = f(\mathbf{W}_i^{(q)} \mathbf{q},\; \mathbf{W}_i^{(k)} \mathbf{k},\; \mathbf{W}_i^{(v)} \mathbf{v}) \in \mathbb{R}^{p_v}$$

Here, $$\mathbf{W}_i^{(q)} \in \mathbb{R}^{p_q 	imes d_q}$$, $$\mathbf{W}_i^{(k)} \in \mathbb{R}^{p_k 	imes d_k}$$, and $$\mathbf{W}_i^{(v)} \in \mathbb{R}^{p_v 	imes d_v}$$ are learnable parameter matrices that project the original representations into subspaces of dimensions $$p_q$$, $$p_k$$, and $$p_v$$ respectively. The function $$f$$ denotes the attention pooling operation, such as additive attention or scaled dot-product attention.

Individual Attention Head Computation (General Vector Form)

In a causal multi-head attention mechanism, the output for a single head `j` at a specific token position `i` is computed using the standard Query-Key-Value (QKV) attention function. This calculation is restricted to the current and preceding tokens to maintain the autoregressive property. The formula is: $$\text{head}_j = \text{Att}_{\text{qkv}}(\mathbf{q}_i^{[j]}, \mathbf{K}_{\leq i}^{[j]}, \mathbf{V}_{\leq i}^{[j]})$$ Here, $\mathbf{q}_i^{[j]}$ is the query vector for the i-th token projected for head `j`, while $\mathbf{K}_{\leq i}^{[j]}$ and $\mathbf{V}_{\leq i}^{[j]}$ are the key and value matrices for head `j`, containing information from tokens 0 up to `i`.

Causal Attention Output for a Single Head and Token

In a multi-head attention mechanism, each individual attention head computes its output using its own unique Query, Key, and Value matrices, which are distinct linear projections of the same input. What is the primary functional consequence of this design choice?

Debugging an Attention Head

In a multi-head attention mechanism, the output of each individual attention head, denoted as $\text{head}_j$, is a vector. This vector belongs to a real-valued vector space of dimension $d_h$, which is represented by the notation: $$\text{head}_j \in \mathbb{R}^{d_h}$$

Dimensionality of an Attention Head Output

You are examining the computation for a single attention head within a multi-head attention layer. Arrange the following steps in the correct chronological order to produce the output for this individual head.

During text generation, the output of the $$j$$-th attention head at step $$i$$ is computed by applying the Query-Key-Value (QKV) attention function to its specific feature sub-space. This operation utilizes the current token's query vector, $$\mathbf{q}_{i}^{[j]}$$, along with the cached key and value matrices for all tokens up to step $$i$$, denoted as $$\mathbf{K}_{\le i}^{[j]}$$ and $$\mathbf{V}_{\le i}^{[j]}$$. By projecting these representations onto the $$j$$-th sub-space, the model can be interpreted as performing attention on a group of independent feature sub-spaces in parallel. The calculation is formalized as: $$\mathrm{head}_j = \mathrm{Att}_{\mathrm{qkv}}(\mathbf{q}_{i}^{[j]}, \mathbf{K}_{\le i}^{[j]}, \mathbf{V}_{\le i}^{[j]})$$.

Autoregressive Individual Attention Head Computation

In a more general vector-level formulation of multi-head attention (Eq. 11.5.2), the final layer output is obtained by stacking the $$h$$ individual head outputs $$\mathbf{h}_1, \ldots, \mathbf{h}_h$$—each lying in $$\mathbb{R}^{p_v}$$—into a single concatenated vector of dimensionality $$hp_v$$, and then multiplying by a learnable output projection matrix $$\mathbf{W}_o \in \mathbb{R}^{p_o 	imes hp_v}$$:

$$\mathbf{W}_o \begin{bmatrix} \mathbf{h}_1 \ \vdots \ \mathbf{h}_h \end{bmatrix} \in \mathbb{R}^{p_o}$$

Unlike the matrix-level formulation that fixes the output projection to $$\mathbb{R}^{d 	imes d}$$, this parameterization allows the output dimensionality $$p_o$$ to differ from both the input dimensionality and the per-head value dimensionality $$p_v$$, providing additional architectural flexibility.

Learn Before

Related