Shape of Key Weight Sub-Matrix per Head
In a multi-head attention mechanism, the key weight sub-matrix for an individual attention head, denoted as , has a shape of . This formula applies specifically when the total dimension of the key projection across all heads is equal to the input representation dimension, . In this context, represents the number of attention heads.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Individual Attention Head Formula
Shape of Key Weight Matrix per Head
Shape of Key Weight Sub-Matrix per Head
In a multi-head attention mechanism with 'M' heads, an engineer makes an implementation error. Instead of creating a unique set of learnable weight matrices for the query, key, and value projections for each of the 'M' heads, the same single set of query, key, and value weight matrices is shared across all heads. What is the primary consequence of this error on the model's functionality?
Rationale for Unique Projections in Multi-Head Attention
Attention Head Specialization
Learn After
In a neural network component, an input representation of dimension 512 is processed by 8 parallel 'heads'. For each head, a 'key' vector is produced by multiplying the input representation by a specific weight matrix. The dimensions of the 'key' vectors from all heads are concatenated, resulting in a final combined dimension of 512. What is the shape of the weight matrix used to produce the 'key' vector for a single head?
Determining the Number of Attention Heads
Debugging a Multi-Head Attention Layer