Learn Before
Base Case for Sequence Probability
In the chain rule for sequence probability, the base case is the first token, . Since there are no preceding tokens, its probability is its marginal probability, . In many language models, this initial token is a deterministic start-of-sequence symbol, meaning its probability is fixed at 1, i.e., . This assumption simplifies the joint probability calculation for the rest of the sequence. Specifically, the probability of the sequence following the initial token, , is unaffected when multiplied by , as shown by the equation: This effectively means the calculation can start from the second token, conditioned on the first.

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Base Case for Sequence Probability
Joint Probability of a Generated Sequence using the Chain Rule
Relationship Between Joint, Conditional, and Marginal Log-Probabilities of Sequences
Derivation of Sequence Log-Probability via Chain Rule
Logarithmic Form of the Chain Rule for Sequence Probability
Formula for an Impossible Initial Event
A language model is tasked with calculating the total probability of the three-token sequence 'the cat sat'. The model provides the following probability estimates:
- The probability of the first token is
Pr("the") = 0.1 - The probability of the second token, given the first, is
Pr("cat" | "the") = 0.5 - The probability of the third token, given the first two, is
Pr("sat" | "the", "cat") = 0.8
Using the principle that the joint probability of a sequence is the product of the conditional probabilities of its components, what is the joint probability
Pr("the", "cat", "sat")?- The probability of the first token is
Computational Stability of Sequence Probability
Which of the following expressions correctly decomposes the joint probability of a four-token sequence
(x₁, x₂, x₃, x₄)using the chain rule of probability?
Learn After
Log-Likelihood Objective for Language Model Training
A language model calculates the joint probability of a sequence of tokens
(x_0, x_1, ..., x_m). The first token,x_0, is a special, deterministic start-of-sequence symbol. How does the nature of this specific first token typically affect the overall calculation of the sequence's joint probability?Calculating Sequence Probability with a Start Token
Analyzing a Language Model's Sequence Probability