Updating the KV Cache
The procedure for updating the Key-Value (KV) cache at a given position is an essential operation during autoregressive sequence generation. Specifically, at a new position , the newly generated key vector () and value vector () are appended to their respective cache matrices, and . Using a function that adds a row vector to a matrix , the update rule is defined as and . This mechanism maintains a history of key-value pairs, enabling a Transformer decoder to attend to past context efficiently.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Single-Step Generation with a KV Cache
Updating the KV Cache
In a self-attention layer processing an input sequence of two tokens, let the input vector for the first token be
x_1and for the second token bex_2. The layer generates a query vectorq_1(for the first token) and a key vectork_2(for the second token). Which statement accurately describes the relationship between these inputs and generated vectors?Correcting a Misconception in Vector Generation
Calculating a Query Vector in Self-Attention
In a standard self-attention mechanism, an input vector is transformed into three separate vectors (Query, Key, and Value) using three distinct, learned weight matrices. Imagine a modified self-attention layer where these three weight matrices are constrained to be identical. What would be the most direct consequence of this change?
Space Complexity of the KV Cache
Updating the KV Cache
Two-Phase Inference from a KV Cache Perspective
Single-Step Generation with a KV Cache
Memory Allocation for KV Caching in Standard Self-Attention
Multi-Dimensional Structure of the KV Cache
An autoregressive language model generates text one word at a time. To generate the 100th word, it must relate it to all 99 previous words. A common optimization involves storing in memory the intermediate representations for each of the first 99 words as they are generated.
Which statement best analyzes the primary computational advantage of this optimization compared to re-computing everything from scratch at step 100?
Chatbot Performance Degradation
Computational Steps in Cached Inference
Diagnosing and Redesigning KV-Cache Memory Behavior in a Multi-Tenant LLM Serving Stack
Choosing a KV-cache strategy for shared-prefix traffic under GPU memory pressure
Evaluating a serving design that combines prefix caching with paged KV memory under mixed prompt lengths
Stabilizing latency and GPU memory in a chat-completions service with shared system prompts
Post-incident analysis: KV-cache growth, fragmentation, and shared-prefix reuse in a streaming LLM service
Root-cause and mitigation plan for OOMs and latency spikes during shared-prefix, long-generation traffic
You run an internal LLM inference service for empl...
Your company’s internal LLM service handles many c...
You operate a GPU-backed LLM service that uses con...
You’re on-call for an internal LLM chat service. M...
Learn After
Single-Step Generation with a KV Cache
Formula for Updating the Key Matrix in the KV Cache
Formula for Updating the Value Matrix in the KV Cache
Example of a Single-Step KV Cache Update
During autoregressive text generation, a model has already processed
Ntokens and stored their corresponding key and value vectors in a cache. When the model processes the(N+1)-th token, how is this cache utilized and modified to compute the output for this new step?An autoregressive model is generating a sequence and has just processed the token at position
t. The Key-Value cache currently stores the key and value vectors for all tokens from position 1 tot. As the model processes the next token at positiont+1, which statement correctly describes how the cache is updated and used for the attention calculation at this new step?Notation for Current Query, Key, and Value Vectors (
q',k',v')Diagram of a Single-Step KV Cache Update and Attention
Debugging a Flawed KV Cache Implementation