Learn Before
Formula for Updating the Value Matrix in the KV Cache
During autoregressive inference, the Value matrix (V) in the KV cache is expanded at each step. The new value vector, , corresponding to the current token, is appended to the existing matrix of values. This update operation is expressed by the formula:
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Single-Step Generation with a KV Cache
Formula for Updating the Key Matrix in the KV Cache
Formula for Updating the Value Matrix in the KV Cache
Example of a Single-Step KV Cache Update
During autoregressive text generation, a model has already processed
Ntokens and stored their corresponding key and value vectors in a cache. When the model processes the(N+1)-th token, how is this cache utilized and modified to compute the output for this new step?An autoregressive model is generating a sequence and has just processed the token at position
t. The Key-Value cache currently stores the key and value vectors for all tokens from position 1 tot. As the model processes the next token at positiont+1, which statement correctly describes how the cache is updated and used for the attention calculation at this new step?Notation for Current Query, Key, and Value Vectors (
q',k',v')Diagram of a Single-Step KV Cache Update and Attention
Debugging a Flawed KV Cache Implementation
Learn After
An autoregressive language model is generating a sequence of tokens. The attention mechanism has already processed the first three tokens, resulting in the following Value matrix
Vstored in its cache, where each row corresponds to a token:V = [[0.1, 0.8], [0.5, 0.2], [0.9, 0.3]]For the fourth token, the model computes a new value vector:
v_new = [0.4, 0.6]According to the standard update rule for the cache, what will the new Value matrix
Vbe after this fourth token is processed?Analysis of KV Cache Update Methods
During autoregressive inference, the Value matrix in the cache is updated by replacing the oldest value vector with the newly computed value vector for the current token.