1Cademy - A transformer model calculates the output for a single attention head `j` at token position `i` using the formula: `$\text{head}_j = \text{Att}_{\text{qkv}}(\mathbf{q}_i^{[j]}, \mathbf{K}_{\le i}^{[g(j)]}, \mathbf{V}_{\le i}^{[g(j)]})$`, where `g(j)` maps the query head `j` to a specific group. What is the primary consequence of using the group-indexed key `$\mathbf{K}_{\le i}^{[g(j)]}$` instead of a head-specific key `$\mathbf{K}

Learn Before

Attention Head Output with Grouped Queries and Causal Masking

Multiple Choice

A transformer model calculates the output for a single attention head j at token position i using the formula: $\text{head}_j = \text{Att}_{\text{qkv}}(\mathbf{q}_i^{[j]}, \mathbf{K}_{\le i}^{[g(j)]}, \mathbf{V}_{\le i}^{[g(j)]})$ , where g(j) maps the query head j to a specific group. What is the primary consequence of using the group-indexed key $\mathbf{K}_{\le i}^{[g(j)]}$ instead of a head-specific key $\mathbf{K}_{\le i}^{[j]}$ ?

Updated 2025-10-02

Contributors are:

Who are from:

Learn Before

Related