Notational Convention for Autoregressive Conditional Probability
In autoregressive models, the notation for the conditional probability of a token, Pr(y_i|x, y_{<i}), is a common shorthand. It signifies the probability of token y_i conditioned on the single sequence formed by concatenating the input x with the preceding output tokens y_{<i}. A more explicit, but less frequently used, notation for this is Pr(y_i|[x, y_{<i}]), where [x, y_{<i}] represents the full context for the prediction.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Mathematical Formulation of LLM Inference
Equivalence of Maximizing Auto-regressive Log-Likelihood and Minimizing Cross-Entropy Loss
Conditional vs. Joint Probability Objectives in Language Modeling
Notational Convention for Autoregressive Conditional Probability
Modeling and Efficient Computation of Conditional Token Probabilities
A language model is generating a response sequence 'y' given an input context 'x'. The model generates the two-token sequence y = ('deep', 'learning'). The model's calculated log-probabilities for each step of the generation are as follows:
- Log-probability of the first token:
log Pr(y₁='deep' | x) = -0.7 - Log-probability of the second token, given the first:
log Pr(y₂='learning' | x, y₁='deep') = -0.4
Based on the standard method for calculating the probability of a full sequence, what is the total conditional log-likelihood of the entire sequence 'y', i.e.,
log Pr(y|x)?- Log-probability of the first token:
Comparing Model Confidence via Log-Likelihood
Analyzing a Flawed Log-Likelihood Calculation
Learn After
Probability Normalization over a Candidate Set
An autoregressive model is given an input prompt,
x, which is the sequence 'The best movie I ever saw was'. The model has already generated the partial output sequence,y_{<i}, which is 'about a'. The model's next task is to predict the probability of the next token,y_i, based on the standard conditional probability notationPr(y_i|x, y_{<i}). What is the actual, full sequence of tokens the model uses as its context to make this prediction?In the context of autoregressive sequence generation, the notation
Pr(y_i|x, y_{<i})implies that the model treats the inputxand the previously generated tokensy_{<i}as two separate, distinct sources of information for predicting the next tokeny_i.Interpreting Autoregressive Model Inputs