Learn Before
Reusability of Key-Value Pairs in Autoregressive Inference
During autoregressive inference, once the key and value vectors for a specific token are computed, they remain constant and are reused in all subsequent generation steps. For example, when generating the i-th token, the model attends to the key-value pairs of all preceding tokens (0 to i-1). These same pairs will be needed again when generating the (i+1)-th token, along with the newly generated pair for token i. This repeated usage makes re-computation inefficient and provides the primary motivation for the KV cache.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Computational Cost per Token in Causal Attention
Reusability of Key-Value Pairs in Autoregressive Inference
Example of Query-Key Interactions in Causal Attention
An autoregressive model is generating a sequence of tokens one by one. It is currently calculating the attention output for the token at position 4 (i.e., the fifth token in the sequence). To ensure the model only uses information it has already seen, which set of key (K) and value (V) vectors must be used as input to the attention mechanism for the query vector at position 4 (q₄)?
Diagnosing Information Leakage in an Autoregressive Model
When calculating the attention output for a specific token at position
iin an autoregressive model, the mechanism is structured to use the query vector from that same position (q_i), while the key and value matrices are composed of the corresponding vectors from all positions in the full input sequence.
Learn After
Key-Value (KV) Cache in Transformer Inference
Computational Efficiency in Autoregressive Generation
An autoregressive model is generating a sequence of text. To produce the 5th token, it computes attention using a query from position 5 and the key/value pairs from positions 1-4. When the model then proceeds to generate the 6th token, which statement accurately describes the most computationally efficient approach for handling the key and value pairs from the first four tokens (positions 1-4)?
During an autoregressive text generation process, to produce the 10th token in a sequence, the model must re-calculate the key and value vectors for all nine preceding tokens to ensure the contextual information is current.