Learn Before
Multiple Choice

An autoregressive model is generating a sequence of text. It has already produced 100 tokens and is now in the process of generating the 101st token. The model uses an optimization where the key and value vectors for all 100 previous tokens are stored and reused. What is the primary computational advantage of this storage mechanism when calculating the attention for the 101st token?

0

1

Updated 2025-10-04

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science