Analyze the two proposed modifications below for reducing the memory footprint of a model's Key-Value (KV) cache during text generation. For each option, describe how it alters the cache's multi-dimensional structure and discuss the likely trade-off for the model's ability to understand context.

Google

The Key-Value (KV) cache in Transformer models is a dynamic data structure whose size is determined by several dimensions. These dimensions include the number of layers in the model ($$L$$), the number of attention heads per layer ($$\tau$$), and the length of the input sequence. Each attention head also contributes a key/value vector of a specific dimensionality ($$d_h$$), making the overall cache a multi-dimensional entity.

Multi-Dimensional Structure of the KV Cache

An engineer modifies a large language model by doubling the number of attention heads per layer while simultaneously halving the dimensionality of each head's key/value vectors. Assuming all other parameters (like the number of layers and sequence length) remain constant, how does this architectural change affect the multi-dimensional structure of the model's key-value (KV) cache?

KV Cache Structure Trade-offs

Consider a Transformer-based model with the following specifications: 12 layers, 8 attention heads per layer, and a key/value vector dimensionality of 64 for each head. When processing a single new token, what is the total number of floating-point values that must be added to the model's entire key-value cache? Show the formula you used for your calculation.

Learn Before

Related