Decoding Phase as a Memory-Bound Process
The decoding phase in Transformer models is considered a memory-bound operation because it requires frequent access to the Key-Value (KV) cache. This computational bottleneck is exacerbated as the output sequence grows, since the cost of decoding increases significantly with each new token generated.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Decoding Phase as a Memory-Bound Process
Diagram of the N-th Step in Transformer Decoding
A large language model has processed an initial prompt and has just generated the fifth token of its output. As it prepares to generate the sixth token, which of the following statements most accurately describes the function of the self-attention mechanism in this specific step?
A large language model is generating a response one token at a time after processing the initial prompt. Arrange the following actions in the correct sequence to describe how a single new token is generated.
Q, K, and V Composition in Transformer Decoding
Analyzing a Flawed Decoding Step