Learn Before
Autoregressive Individual Attention Head Computation
During text generation, the output of the -th attention head at step is computed by applying the Query-Key-Value (QKV) attention function to its specific feature sub-space. This operation utilizes the current token's query vector, , along with the cached key and value matrices for all tokens up to step , denoted as and . By projecting these representations onto the -th sub-space, the model can be interpreted as performing attention on a group of independent feature sub-spaces in parallel. The calculation is formalized as: .
0
1
Tags
Foundations of Large Language Models
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Multi-Head Attention Output Calculation
Causal Attention Output for a Single Head and Token
In a multi-head attention mechanism, each individual attention head computes its output using its own unique Query, Key, and Value matrices, which are distinct linear projections of the same input. What is the primary functional consequence of this design choice?
Debugging an Attention Head
Dimensionality of an Attention Head Output
You are examining the computation for a single attention head within a multi-head attention layer. Arrange the following steps in the correct chronological order to produce the output for this individual head.
Autoregressive Individual Attention Head Computation