In a specific attention mechanism, there are 8 query heads (indexed j=1 to 8) and 2 distinct Key-Value (KV) groups (indexed g=1 to 2). Query heads 1 through 4 are assigned to KV group 1, while query heads 5 through 8 are assigned to KV group 2. The output for a given query head j is calculated based on its own query vector q^[j] and the Key-Value pair from its assigned group, (K^[g(j)], V^[g(j)]). Which Key-Value pair will query head 6 use for its computation?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
In a specific attention mechanism, there are 8 query heads (indexed j=1 to 8) and 2 distinct Key-Value (KV) groups (indexed g=1 to 2). Query heads 1 through 4 are assigned to KV group 1, while query heads 5 through 8 are assigned to KV group 2. The output for a given query head
jis calculated based on its own query vectorq^[j]and the Key-Value pair from its assigned group,(K^[g(j)], V^[g(j)]). Which Key-Value pair will query head 6 use for its computation?In a grouped-query attention system with 12 query heads (indexed j=1 to 12), the function
g(j)maps a query headjto its corresponding key-value group. This mapping is defined by the formulag(j) = floor((j-1) / 3) + 1. Based on this, which of the following pairs of query heads will use the same set of Key and Value matrices for their attention computation?Consider an attention mechanism where the output for a head
jis computed by the formulahead_j = Att_qkv(q_i^[j], K_<=i^[g(j)], V_<=i^[g(j)]). In this setup,q_i^[j]is a query vector unique to headj, while the functiong(j)maps headjto a potentially shared key-value group.Statement: If two distinct query heads,
j1andj2, are mapped to the same key-value group (meaningg(j1) = g(j2)), their final output vectors,head_j1andhead_j2, will necessarily be identical.