Learn Before
Derivation of Sequence Log-Probability via Chain Rule
The log-probability of a sequence is derived by applying the logarithm to the product form of the chain rule of probability. This key step transforms the product of conditional probabilities into a more computationally stable sum. The derivation proceeds as follows:
This decomposition is a foundational step for formulating the log-likelihood objective in language models.

0
1
References
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Base Case for Sequence Probability
Joint Probability of a Generated Sequence using the Chain Rule
Relationship Between Joint, Conditional, and Marginal Log-Probabilities of Sequences
Derivation of Sequence Log-Probability via Chain Rule
Logarithmic Form of the Chain Rule for Sequence Probability
Formula for an Impossible Initial Event
A language model is tasked with calculating the total probability of the three-token sequence 'the cat sat'. The model provides the following probability estimates:
- The probability of the first token is
Pr("the") = 0.1 - The probability of the second token, given the first, is
Pr("cat" | "the") = 0.5 - The probability of the third token, given the first two, is
Pr("sat" | "the", "cat") = 0.8
Using the principle that the joint probability of a sequence is the product of the conditional probabilities of its components, what is the joint probability
Pr("the", "cat", "sat")?- The probability of the first token is
Computational Stability of Sequence Probability
Which of the following expressions correctly decomposes the joint probability of a four-token sequence
(x₁, x₂, x₃, x₄)using the chain rule of probability?
Learn After
Log-Likelihood of a Sequence
When calculating the probability of a long sequence of words, the standard approach involves multiplying many conditional probabilities, each of which is a value between 0 and 1. This product is often converted into a sum by applying the logarithm to each term. What is the primary computational reason for this transformation?
A language model calculates the probability of a sequence of tokens, , using the product of conditional probabilities: . To improve numerical stability and simplify calculations, this product is converted into a sum by taking the logarithm. Which of the following expressions correctly represents the log-probability, ?
Calculating Sequence Log-Probability