Learn Before
An autoregressive model is generating a sequence of text. It has already produced 100 tokens and is now in the process of generating the 101st token. The model uses an optimization where the key and value vectors for all 100 previous tokens are stored and reused. What is the primary computational advantage of this storage mechanism when calculating the attention for the 101st token?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An autoregressive model is generating a sequence of text. It has already produced 100 tokens and is now in the process of generating the 101st token. The model uses an optimization where the key and value vectors for all 100 previous tokens are stored and reused. What is the primary computational advantage of this storage mechanism when calculating the attention for the 101st token?
Inference Performance Analysis
State of the Key-Value Cache During Generation