Learn Before
Layer-wise Structure of the KV Cache
The overall Key-Value (KV) cache generated by a Transformer's decoding network is a composite structure containing the individual KV caches from each of its layers. For a model with layers, the complete cache is represented as a collection of these layer-specific caches: where is the KV cache from the -th layer.

0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Layer-wise Structure of the KV Cache
A large language model processes an input prompt, denoted as
x, using a functionDec_kv(x)as part of its inference process. This function utilizes the model's standard decoding network but is configured for a specific preparatory task. Based on this context, what is the primary output of theDec_kv(x)function?In the context of prefilling a Key-Value cache for an input prompt, the function
Dec_kv(·)represents a neural network with a fundamentally different architecture than the standard decoding network,Dec(·), as it is specialized solely for computing key-value pairs.Relationship Between Decoding Networks for Inference
Learn After
KV Cache Memory Scaling
A developer is examining the internal state of a 12-layer Transformer decoder after it has processed an input prompt. They notice that the generated Key-Value (KV) cache is not a single, large data structure, but is instead organized as a collection of 12 separate caches. What is the fundamental reason for this layer-wise organization?
Accessing a Specific Layer's KV Cache