An engineer is analyzing the computational architecture of a large language model. They observe the following formula being used to calculate the output for an individual attention head j at a specific step i:
head_j = Attention(q_i^[j], K_<=i, V_<=i)
Based only on the components of this formula, what is the most accurate conclusion the engineer can draw about the relationship between the different attention heads in this layer?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Analysis of Attention Head Architectures
An engineer is analyzing the computational architecture of a large language model. They observe the following formula being used to calculate the output for an individual attention head
jat a specific stepi:head_j = Attention(q_i^[j], K_<=i, V_<=i)Based only on the components of this formula, what is the most accurate conclusion the engineer can draw about the relationship between the different attention heads in this layer?
In a Multi-Query Attention (MQA) layer, all attention heads share the same Key and Value matrices. The formula for the output of a single, specific head
jat stepiis given as:head_j = Att_qkv(______, K_<=i, V_<=i). What component correctly fills the blank to represent the unique input for this specific head?