Learn Before
KV Cache Memory Calculation
An autoregressive language model uses a cache to store key and value vectors for each token in the context window, across all attention heads and layers. This speeds up the generation of subsequent tokens. Given the following parameters for a specific model, calculate the total number of individual floating-point values that would be stored in this cache when it is completely full.
- Number of layers: 32
- Number of attention heads per layer: 12
- Dimensionality of each head's key/value vector: 64
- Context window size: 2048 tokens
Provide only the final numerical answer.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An autoregressive language model uses a key-value cache to store contextual information during text generation. A developer decides to double the maximum sequence length that the model can process. Assuming all other architectural parameters (such as the number of layers, number of attention heads, and the dimensionality of each head) remain constant, by what factor will the maximum memory required for the key-value cache change?
Optimizing KV Cache for a Chatbot Application
KV Cache Memory Calculation