Key-Value Cache
In autoregressive transformer models, a Key-Value (KV) cache is a mechanism used to optimize the generation process during inference. It stores the computed key and value matrices from all previous steps, so they do not need to be recomputed for each new token being generated. Because the model performs attention on a group of feature sub-spaces in parallel, the KV cache must retain the keys and values for all attention heads. The KV cache at generation step can be represented as the set of historical keys and values across these heads: .
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Query (Attention)
Key (Attention)
Value (Attention)
State Function from Previous Outputs
Value Weight Matrix Formula
Set of Sequential Key-Value Pairs
Query Vector
Key Vector
Value Vector
Implicit Relative Position Modeling in Self-Attention with RoPE
Value Weight Matrix Definition ()
Imagine a system translating the sentence 'The quick brown fox jumps'. When the system is generating the output word corresponding to 'jumps', it needs to determine which words in the input sentence are most relevant. To do this, a vector representing the current translation context (i.e., 'what information do I need to produce the next word?') is compared against a set of searchable 'label' vectors, one for each word in the input sentence. This comparison generates a relevance score for each input word. Finally, a new vector is created by taking a weighted average of the 'content' vectors of the input words, using the relevance scores as weights. How do the three main vector types in this process correspond to their roles?
In a system designed to answer questions based on a provided document, the model first creates a representation of the user's question. It then compares this representation against a set of searchable representations, one for each sentence in the document, to determine relevance scores. Finally, it constructs an answer by creating a weighted combination of the informational content from each sentence, using the relevance scores as weights. Which option correctly assigns the roles of Query, Key, and Value vectors in this scenario?
Context Window of Key Vectors Notation
Key-Value Cache
In a computational mechanism designed to selectively focus on different parts of an input sequence, information is represented by three distinct types of vectors that interact to produce a context-aware output. Match each vector type to its specific role in this process.
Masked QKV Attention Formula
Set of Sequential Key-Value Pairs
Let a sequence of vectors be constructed where the first element is and the second element is . The third element has multiple potential versions, and the 5th version is given as . According to the notational definition , what is the specific sequence represented by when using the 5th version of the 3rd element?
Key Matrix for Causal Attention (K_≤i)
Deconstructing Vector Prefix Notation
Key-Value Cache
Consider a sequence of vectors represented as . The notation represents the subsequence containing only the first two vectors, .
Learn After
An autoregressive model is generating a sequence of text. It has already produced 100 tokens and is now in the process of generating the 101st token. The model uses an optimization where the key and value vectors for all 100 previous tokens are stored and reused. What is the primary computational advantage of this storage mechanism when calculating the attention for the 101st token?
Inference Performance Analysis
State of the Key-Value Cache During Generation