1Cademy - Rationale for Unique Projections in Multi-Head Attention

Learn Before

Query, Key, and Value Projections in Multi-Head Attention

Short Answer

Rationale for Unique Projections in Multi-Head Attention

In the context of a multi-head attention mechanism, explain the primary reason for using distinct, learnable weight matrices to project the input representation into separate Query, Key, and Value sets for each individual attention head.

Updated 2025-10-04

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

Individual Attention Head Formula
Shape of Key Weight Matrix per Head
Shape of Key Weight Sub-Matrix per Head
In a multi-head attention mechanism with 'M' heads, an engineer makes an implementation error. Instead of creating a unique set of learnable weight matrices for the query, key, and value projections for each of the 'M' heads, the same single set of query, key, and value weight matrices is shared across all heads. What is the primary consequence of this error on the model's functionality?
Rationale for Unique Projections in Multi-Head Attention
Attention Head Specialization

Learn Before

Related