Learn Before
Formula for Updating the Key Matrix in the KV Cache
During autoregressive inference, the Key matrix (K) in the KV cache is expanded at each step. The new key vector, , corresponding to the current token, is appended to the existing matrix of keys. This update operation is expressed by the formula:

0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Single-Step Generation with a KV Cache
Formula for Updating the Key Matrix in the KV Cache
Formula for Updating the Value Matrix in the KV Cache
Example of a Single-Step KV Cache Update
During autoregressive text generation, a model has already processed
Ntokens and stored their corresponding key and value vectors in a cache. When the model processes the(N+1)-th token, how is this cache utilized and modified to compute the output for this new step?An autoregressive model is generating a sequence and has just processed the token at position
t. The Key-Value cache currently stores the key and value vectors for all tokens from position 1 tot. As the model processes the next token at positiont+1, which statement correctly describes how the cache is updated and used for the attention calculation at this new step?Notation for Current Query, Key, and Value Vectors (
q',k',v')Diagram of a Single-Step KV Cache Update and Attention
Debugging a Flawed KV Cache Implementation
Learn After
In a transformer model generating text, a matrix of 'key' vectors is maintained for all previously generated tokens. Suppose at a certain step, this matrix
Kcontains vectors for two previous tokens and is represented as:K = [[0.1, 0.5], [0.9, 0.2]]The model then processes a new token and generates a corresponding new key vector
k_new:k_new = [0.4, 0.8]Based on the standard procedure for expanding this matrix during text generation, what will the updated matrix
Kbe after incorporatingk_new?A large language model is generating the next token in a sequence. Arrange the following steps in the correct chronological order as they relate to updating the matrix of 'key' vectors for the attention mechanism.
In an autoregressive language model generating a sequence of text, the matrix containing 'key' vectors for previously generated tokens is updated at each step. Consider a scenario where this matrix has been populated with vectors from the first 10 tokens. When the 11th token is processed and its corresponding key vector is generated, the update procedure involves replacing the key vector of the very first token with the new one to keep the matrix size constant.