Learn Before
Consider an autoregressive model generating a sequence of tokens one by one. At each step i, the model calculates attention using the query from the current token and the keys and values from all tokens generated so far (from position 1 to i). To optimize this process, the model maintains a growing set of all previously computed key and value vectors. What is the primary computational advantage of this strategy?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Set of Indexed Key-Value Pairs
Set of Superscript-Indexed Vectors
Set of Key-Value Pairs
Function of a Sequence of Overlined Variables
Function of a Sequence of Averaged Vectors
Vector Slice Notation for a Sequence Window ()
Set of Sequential Vectors Notation
Vector Sequence Window Notation
Consider an autoregressive model generating a sequence of tokens one by one. At each step
i, the model calculates attention using the query from the current token and the keys and values from all tokens generated so far (from position 1 toi). To optimize this process, the model maintains a growing set of all previously computed key and value vectors. What is the primary computational advantage of this strategy?State of an Autoregressive Cache
An autoregressive language model with
τparallel computational units (e.g., attention heads) is generating a sequence of tokens. After computing the output for the 3rd token, the model stores the key and value vectors from all tokens processed so far to use in subsequent steps. Which of the following notations correctly represents the complete set of these stored key-value pairs at this specific moment?