Modeling and Efficient Computation of Conditional Token Probabilities
A crucial aspect of implementing autoregressive language models involves two interconnected tasks: first, defining a model for the conditional probability of the next token, Pr(yi|x, y_{<i}), and second, ensuring that this probability can be calculated in a computationally efficient way.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Mathematical Formulation of LLM Inference
Equivalence of Maximizing Auto-regressive Log-Likelihood and Minimizing Cross-Entropy Loss
Conditional vs. Joint Probability Objectives in Language Modeling
Notational Convention for Autoregressive Conditional Probability
Modeling and Efficient Computation of Conditional Token Probabilities
A language model is generating a response sequence 'y' given an input context 'x'. The model generates the two-token sequence y = ('deep', 'learning'). The model's calculated log-probabilities for each step of the generation are as follows:
- Log-probability of the first token:
log Pr(yโ='deep' | x) = -0.7 - Log-probability of the second token, given the first:
log Pr(yโ='learning' | x, yโ='deep') = -0.4
Based on the standard method for calculating the probability of a full sequence, what is the total conditional log-likelihood of the entire sequence 'y', i.e.,
log Pr(y|x)?- Log-probability of the first token:
Comparing Model Confidence via Log-Likelihood
Analyzing a Flawed Log-Likelihood Calculation
Model-Specific Optimizations for LLM Inference
Modeling and Efficient Computation of Conditional Token Probabilities
Efficient Generation of Candidate Solutions via Search Algorithms
An AI research team is developing a new generative model for creating complex musical compositions. They find that while their model can accurately calculate the probability of any given short musical phrase, generating a full, high-quality, multi-minute symphony is computationally intractable because they cannot feasibly check every possible combination of notes to find the absolute best one. How does this team's challenge relate to the broader field of artificial intelligence?
Comparing Computational Challenges in AI Tasks
Identifying Common Computational Structures in AI
Accuracy-Efficiency Trade-off in LLM Inference
Learn After
Language Model Design Trade-offs
When designing an autoregressive language model, a key decision is how to model the conditional probability of the next token given the context,
Pr(yi|x, y_{<i}). Consider two approaches:- Approach 1: Uses a fixed-size window, considering only the
kmost recent previous tokens (y_{i-k}, ..., y_{i-1}) to predict the next tokenyi. - Approach 2: Processes the entire preceding sequence (
y_{<i}) to predict the next tokenyi.
Which statement best analyzes the fundamental trade-off between these two approaches regarding the modeling and efficient computation of this probability?
- Approach 1: Uses a fixed-size window, considering only the
Computational Scaling in Autoregressive Models