Single-Step Generation with a KV Cache
During each step i of autoregressive generation, the model computes a new query (), key (), and value () vector from the current input token. The new key-value pair () is appended to the KV Cache, which holds the pairs for all preceding tokens. The attention operation is then performed using the new query and the complete set of keys and values stored in the cache up to the current step, denoted as and . This process generates the output for step i by allowing the current token to attend to itself and all previous tokens in the sequence.

0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.2 Generative Models - Foundations of Large Language Models
Related
Single-Step Generation with a KV Cache
Updating the KV Cache
In a self-attention layer processing an input sequence of two tokens, let the input vector for the first token be
x_1and for the second token bex_2. The layer generates a query vectorq_1(for the first token) and a key vectork_2(for the second token). Which statement accurately describes the relationship between these inputs and generated vectors?Correcting a Misconception in Vector Generation
Calculating a Query Vector in Self-Attention
In a standard self-attention mechanism, an input vector is transformed into three separate vectors (Query, Key, and Value) using three distinct, learned weight matrices. Imagine a modified self-attention layer where these three weight matrices are constrained to be identical. What would be the most direct consequence of this change?
Single-Step Generation with a KV Cache
Formula for Updating the Key Matrix in the KV Cache
Formula for Updating the Value Matrix in the KV Cache
Example of a Single-Step KV Cache Update
During autoregressive text generation, a model has already processed
Ntokens and stored their corresponding key and value vectors in a cache. When the model processes the(N+1)-th token, how is this cache utilized and modified to compute the output for this new step?An autoregressive model is generating a sequence and has just processed the token at position
t. The Key-Value cache currently stores the key and value vectors for all tokens from position 1 tot. As the model processes the next token at positiont+1, which statement correctly describes how the cache is updated and used for the attention calculation at this new step?Notation for Current Query, Key, and Value Vectors (
q',k',v')Diagram of a Single-Step KV Cache Update and Attention
Debugging a Flawed KV Cache Implementation
Diagram of the Decoding Phase
Single-Step Generation with a KV Cache
Comparison of Prefilling and Decoding Phases
Disaggregation of Prefilling and Decoding using Pipelined Engines
After a large language model processes an initial prompt, it enters a generation stage where it produces the output sequence one token at a time. In each step of this stage, a new query vector is generated for the current position, and it must perform an attention operation over the key-value pairs of the initial prompt plus all the key-value pairs of the tokens generated in previous steps. As the output sequence gets longer, what becomes the most significant performance bottleneck for generating each new token?
A large language model has finished processing an initial prompt and is about to generate the first token of its response. Arrange the following events in the correct chronological order for this single generation step.
Evaluating an Inference Optimization Proposal
You run an internal LLM inference service for empl...
You’re on-call for an internal LLM chat service. M...
You operate a GPU-backed LLM service that uses con...
Your company’s internal LLM service handles many c...
Evaluating a serving design that combines prefix caching with paged KV memory under mixed prompt lengths
Choosing a KV-cache strategy for shared-prefix traffic under GPU memory pressure
Diagnosing and Redesigning KV-Cache Memory Behavior in a Multi-Tenant LLM Serving Stack
Stabilizing latency and GPU memory in a chat-completions service with shared system prompts
Root-cause and mitigation plan for OOMs and latency spikes during shared-prefix, long-generation traffic
Post-incident analysis: KV-cache growth, fragmentation, and shared-prefix reuse in a streaming LLM service
Decoding Phase Goal Formula
Space Complexity of the KV Cache
Updating the KV Cache
Two-Phase Inference from a KV Cache Perspective
Single-Step Generation with a KV Cache
Memory Allocation for KV Caching in Standard Self-Attention
Multi-Dimensional Structure of the KV Cache
An autoregressive language model generates text one word at a time. To generate the 100th word, it must relate it to all 99 previous words. A common optimization involves storing in memory the intermediate representations for each of the first 99 words as they are generated.
Which statement best analyzes the primary computational advantage of this optimization compared to re-computing everything from scratch at step 100?
Chatbot Performance Degradation
Computational Steps in Cached Inference
Diagnosing and Redesigning KV-Cache Memory Behavior in a Multi-Tenant LLM Serving Stack
Choosing a KV-cache strategy for shared-prefix traffic under GPU memory pressure
Evaluating a serving design that combines prefix caching with paged KV memory under mixed prompt lengths
Stabilizing latency and GPU memory in a chat-completions service with shared system prompts
Post-incident analysis: KV-cache growth, fragmentation, and shared-prefix reuse in a streaming LLM service
Root-cause and mitigation plan for OOMs and latency spikes during shared-prefix, long-generation traffic
You run an internal LLM inference service for empl...
Your company’s internal LLM service handles many c...
You operate a GPU-backed LLM service that uses con...
You’re on-call for an internal LLM chat service. M...
Learn After
Next Token Prediction Formula
An autoregressive model is generating the 11th token of a sequence. The Key-Value (KV) Cache has already been populated with the key and value vectors for the first 10 tokens. For this 11th generation step, a new query (q_11), key (k_11), and value (v_11) vector are computed. Which of the following accurately describes the set of key vectors that the new query (q_11) will perform its attention operation over to produce the output for this step?
You are observing a single step of autoregressive generation in a transformer model, specifically for the token at position
i. Arrange the following computational events in the correct chronological order for this single step.Formula for Cache State Evolution during Autoregressive Decoding
Analyzing a Flawed KV Cache Implementation