Learn Before
Causal Attention Output for a Single Head and Token
In a causal multi-head attention mechanism, the output for a single head j at a specific token position i is computed using the standard Query-Key-Value (QKV) attention function. This calculation is restricted to the current and preceding tokens to maintain the autoregressive property. The formula is: Here, is the query vector for the i-th token projected for head j, while and are the key and value matrices for head j, containing information from tokens 0 up to i.

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Multi-Head Attention Output Calculation
Causal Attention Output for a Single Head and Token
In a multi-head attention mechanism, each individual attention head computes its output using its own unique Query, Key, and Value matrices, which are distinct linear projections of the same input. What is the primary functional consequence of this design choice?
Debugging an Attention Head
Dimensionality of an Attention Head Output
You are examining the computation for a single attention head within a multi-head attention layer. Arrange the following steps in the correct chronological order to produce the output for this individual head.
Autoregressive Individual Attention Head Computation
Learn After
An autoregressive language model is in the process of generating a sequence of tokens. When a single attention head calculates its output for the 4th token in the sequence, which set of key and value vectors does it use to ensure it only relies on previously generated information?
True or False: In a causal attention mechanism, when a single attention head is calculating the output for the 4th token in a sequence, the query vector for that 4th token (q_4) will interact with the key vector from the 6th token (k_6) to compute an attention score.
Causal Attention Inputs