1Cademy - An autoregressive model is generating a sequence of text. It has already produced 100 tokens and is now in the process of generating the 101st token. The model uses an optimization where the key and value vectors for all 100 previous tokens are stored and reused. What is the primary computational advantage of this storage mechanism when calculating the attention for the 101st token?

Learn Before

Key-Value Cache

Multiple Choice

An autoregressive model is generating a sequence of text. It has already produced 100 tokens and is now in the process of generating the 101st token. The model uses an optimization where the key and value vectors for all 100 previous tokens are stored and reused. What is the primary computational advantage of this storage mechanism when calculating the attention for the 101st token?

Updated 2025-10-04

Contributors are:

Who are from:

Learn Before

Related