1Cademy - Chain Rule for Sequence Probability

Learn Before

Language Models (LMs)
Autoregressive Conditional Probability

Chain Rule for Sequence Probability

The chain rule of probability is a fundamental principle used in language modeling to calculate the joint probability of a sequence of tokens, such as {x_0, x_1, ..., x_m}. It works by decomposing the joint probability into a product of conditional probabilities. Each term in this product, $\text{Pr}(x_i|x_0, ..., x_{i-1})$ , represents the probability of token $x_i$ occurring, given all the tokens that came before it. The general formula is: $\text{Pr}(x_0, ..., x_m) = \text{Pr}(x_0) \cdot \text{Pr}(x_1|x_0) \cdot \text{Pr}(x_2|x_0, x_1) \cdots \text{Pr}(x_m|x_0, ..., x_{m-1})$ This can be written more compactly using product notation: $= \prod_{i=0}^{m} \text{Pr}(x_i|x_0, ..., x_{i-1})$ By convention, the conditional probability for the first term, where i=0, is simply the marginal probability (\text{Pr}(x_0)). For practical applications, this formula can be expressed in a logarithmic form, which converts the product into a sum and improves numerical stability.

2 months ago

Contributors are:

Who are from:

References

Learn After

Base Case for Sequence Probability
Joint Probability of a Generated Sequence using the Chain Rule
Relationship Between Joint, Conditional, and Marginal Log-Probabilities of Sequences
Derivation of Sequence Log-Probability via Chain Rule
Logarithmic Form of the Chain Rule for Sequence Probability
Formula for an Impossible Initial Event
A language model is tasked with calculating the total probability of the three-token sequence 'the cat sat'. The model provides the following probability estimates:
- The probability of the first token is Pr("the") = 0.1
- The probability of the second token, given the first, is Pr("cat" | "the") = 0.5
- The probability of the third token, given the first two, is Pr("sat" | "the", "cat") = 0.8
Using the principle that the joint probability of a sequence is the product of the conditional probabilities of its components, what is the joint probability Pr("the", "cat", "sat")?
Computational Stability of Sequence Probability
Which of the following expressions correctly decomposes the joint probability of a four-token sequence (x₁, x₂, x₃, x₄) using the chain rule of probability?

Learn Before

Related

Learn After