Learn Before
Analyzing a Flawed KV Cache Implementation
A developer is implementing an autoregressive generation process. They propose an 'optimized' single-step procedure for generating the token at position i. In their design, the model first computes the new query q_i. It then performs the attention operation using q_i and the set of keys and values already stored in the cache from all previous steps (1 to i-1). Only after this attention calculation is complete are the newly computed key k_i and value v_i appended to the cache for use in the next step (i+1). Analyze this proposed procedure. What is the fundamental flaw in this logic, and what would be the likely consequence for the model's generated output?
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.2 Generative Models - Foundations of Large Language Models
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Next Token Prediction Formula
An autoregressive model is generating the 11th token of a sequence. The Key-Value (KV) Cache has already been populated with the key and value vectors for the first 10 tokens. For this 11th generation step, a new query (q_11), key (k_11), and value (v_11) vector are computed. Which of the following accurately describes the set of key vectors that the new query (q_11) will perform its attention operation over to produce the output for this step?
You are observing a single step of autoregressive generation in a transformer model, specifically for the token at position
i. Arrange the following computational events in the correct chronological order for this single step.Formula for Cache State Evolution during Autoregressive Decoding
Analyzing a Flawed KV Cache Implementation