Learn Before
Reducing KV Cache Complexity via Windowed Caching
The space complexity of the standard Key-Value (KV) cache, which grows linearly with the number of tokens as , can be reduced by caching fewer tokens. For instance, sliding window attention utilizes a fixed-size window to store keys and values only for the local context. This restricts the caching mechanism's space complexity to a constant , making it more manageable regardless of the overall sequence length.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Reducing KV Cache Complexity via Windowed Caching
An engineer is deploying a large autoregressive model for a chatbot. They observe that as a conversation with a user gets longer, the model's memory consumption increases steadily, eventually leading to performance issues. This is because the model stores key and value vectors for every token in the conversation history to speed up the generation of the next token. Based on this mechanism, what is the fundamental relationship between the length of the conversation history (in tokens) and the amount of memory required for this storage?
KV Cache Memory Footprint Comparison
Calculating Memory Growth for Token Caching
Reducing KV Cache Complexity via Head Sharing
Formula for KV Cache Memory Size
Learn After
Space Complexity of Sliding Window Attention
Optimizing Memory for Long-Document Processing
An auto-regressive language model is generating a long text, one token at a time. To manage memory, it employs a key-value caching strategy where it only stores the keys and values for the most recent 2048 tokens. How will the memory allocated for this cache change as the model generates the 5000th token and continues beyond it?
Comparing KV Cache Memory Growth