Attention Head Specialization
Based on the mechanism that generates query, key, and value matrices, explain how it is possible for these different heads to learn and attend to such different types of information, even though they all begin with the exact same input representation.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Individual Attention Head Formula
Shape of Key Weight Matrix per Head
Shape of Key Weight Sub-Matrix per Head
In a multi-head attention mechanism with 'M' heads, an engineer makes an implementation error. Instead of creating a unique set of learnable weight matrices for the query, key, and value projections for each of the 'M' heads, the same single set of query, key, and value weight matrices is shared across all heads. What is the primary consequence of this error on the model's functionality?
Rationale for Unique Projections in Multi-Head Attention
Attention Head Specialization