Learn Before
Logarithmic Form of the Chain Rule for Sequence Probability
The chain rule for calculating the joint probability of a sequence can be expressed in an alternative logarithmic form. This is achieved by taking the logarithm of the entire probability expression, which transforms the product of conditional probabilities into a sum. This summation form is computationally more stable, especially for long sequences, as it mitigates the risk of numerical underflow from multiplying many small fractions. The formula is: . In this formulation, it is assumed that for the initial token where , the probability is . As a consequence of this assumption, the overall probability of the sequence simplifies to .

0
1
References
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Base Case for Sequence Probability
Joint Probability of a Generated Sequence using the Chain Rule
Relationship Between Joint, Conditional, and Marginal Log-Probabilities of Sequences
Derivation of Sequence Log-Probability via Chain Rule
Logarithmic Form of the Chain Rule for Sequence Probability
Formula for an Impossible Initial Event
A language model is tasked with calculating the total probability of the three-token sequence 'the cat sat'. The model provides the following probability estimates:
- The probability of the first token is
Pr("the") = 0.1 - The probability of the second token, given the first, is
Pr("cat" | "the") = 0.5 - The probability of the third token, given the first two, is
Pr("sat" | "the", "cat") = 0.8
Using the principle that the joint probability of a sequence is the product of the conditional probabilities of its components, what is the joint probability
Pr("the", "cat", "sat")?- The probability of the first token is
Computational Stability of Sequence Probability
Which of the following expressions correctly decomposes the joint probability of a four-token sequence
(x₁, x₂, x₃, x₄)using the chain rule of probability?
Learn After
A language model is tasked with calculating the joint probability of a very long sequence of words, such as an entire book chapter. The model computes the conditional probability for each word given its preceding context. When the model attempts to find the total probability of the chapter by multiplying these thousands of individual conditional probabilities (which are all fractions less than 1), which computational issue is most likely to occur, and why is converting the calculation to a sum of logarithms the standard solution?
Calculating Sequence Log Probability
A language model calculates the total log probability for two different sequences of words. The total log probability for Sequence A is -8.7, and the total log probability for Sequence B is -10.2. Based solely on these values, what can be concluded about the likelihood of these two sequences?