Computational Scaling in Autoregressive Models
In an autoregressive language model, the probability of the next token is calculated based on the input and all previously generated tokens. Explain why the computational cost of calculating this probability for each new token can become a significant challenge as the generated sequence gets longer. What is the primary source of this increasing computational load?
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Language Model Design Trade-offs
When designing an autoregressive language model, a key decision is how to model the conditional probability of the next token given the context,
Pr(yi|x, y_{<i}). Consider two approaches:- Approach 1: Uses a fixed-size window, considering only the
kmost recent previous tokens (y_{i-k}, ..., y_{i-1}) to predict the next tokenyi. - Approach 2: Processes the entire preceding sequence (
y_{<i}) to predict the next tokenyi.
Which statement best analyzes the fundamental trade-off between these two approaches regarding the modeling and efficient computation of this probability?
- Approach 1: Uses a fixed-size window, considering only the
Computational Scaling in Autoregressive Models