Learn Before
Implication of an Impossible Initial Event
If the initial event in a sequence is impossible, as denoted by , then the joint probability of any sequence starting with is also zero. This is a direct consequence of the chain rule of probability, where the joint probability is calculated as a product of terms, one of which is .

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Implication of an Impossible Initial Event
A language model calculates the probability of a sequence of three tokens, {x₀, x₁, x₂}, using the formula: Pr(x₀, x₁, x₂) = Pr(x₀) * Pr(x₁|x₀) * Pr(x₂|x₀, x₁). If the model determines that the initial token, x₀, is an impossible event, what is the joint probability of the entire sequence?
Consequence of an Impossible Starting Token
A language model is calculating the probability of the sequence 'Zxq#w the cat sat'. If the model's vocabulary does not contain the token 'Zxq#w', making its initial probability zero, the model can still assign a non-zero probability to the entire sequence by considering the high probabilities of the subsequent words 'the', 'cat', and 'sat'.
Learn After
A language model calculates the joint probability of a sequence of events (e.g., words) by multiplying the probability of the first event by the conditional probabilities of each subsequent event. Given the following probabilities for a three-event sequence (x₀, x₁, x₂), what is the joint probability of the entire sequence?
- Probability of the first event, Pr(x₀) = 0.0
- Probability of the second event given the first, Pr(x₁|x₀) = 0.4
- Probability of the third event given the first two, Pr(x₂|x₀, x₁) = 0.8
Language Model Debugging Scenario
A language model is generating a sequence of words. The first word has a probability of 0. However, the conditional probabilities for all subsequent words in the sequence are very high (e.g., 0.99 for each). In this scenario, the high probabilities of the later words can overcome the initial zero probability, resulting in a non-zero joint probability for the entire sequence.
Explaining Zero Probability Sequences