Learn Before
An engineer is deploying a large autoregressive model for a chatbot. They observe that as a conversation with a user gets longer, the model's memory consumption increases steadily, eventually leading to performance issues. This is because the model stores key and value vectors for every token in the conversation history to speed up the generation of the next token. Based on this mechanism, what is the fundamental relationship between the length of the conversation history (in tokens) and the amount of memory required for this storage?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Reducing KV Cache Complexity via Windowed Caching
An engineer is deploying a large autoregressive model for a chatbot. They observe that as a conversation with a user gets longer, the model's memory consumption increases steadily, eventually leading to performance issues. This is because the model stores key and value vectors for every token in the conversation history to speed up the generation of the next token. Based on this mechanism, what is the fundamental relationship between the length of the conversation history (in tokens) and the amount of memory required for this storage?
KV Cache Memory Footprint Comparison
Calculating Memory Growth for Token Caching
Reducing KV Cache Complexity via Head Sharing
Formula for KV Cache Memory Size