Chain Rule for Sequence Probability
The chain rule of probability is a core mathematical concept used in language modeling to determine the joint probability of a sequence of tokens, denoted as . This rule breaks down the overall sequence probability into a product of individual conditional probabilities. Specifically, the probability of the entire sequence is calculated as: . Using the compact product notation, this equates to: . Furthermore, to enhance computational efficiency and numerical stability, this product is frequently transformed into an alternative logarithmic form.

0
1
References
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Types of Language Models
Evaluating language models
Shannon's Foundational Work on Language Modeling
Generalization of the Language Modeling Concept
Chain Rule for Sequence Probability
Deep Learning Approach to Language Modeling
Output Token Sequence in LLMs
Start of Sentence (SOS) Token
[CLS] Token as a Start Symbol
A system is designed to predict the probability of a sequence of words. For the sequence 'The dog ran', the system provides the following conditional probabilities:
- The probability of 'The' occurring at the start of a sequence is 0.2.
- The probability of 'dog' occurring after 'The' is 0.3.
- The probability of 'ran' occurring after 'The dog' is 0.7.
Based on the fundamental principle used by such systems to determine the likelihood of a full sequence, what is the overall probability of the sequence 'The dog ran'?
Analyzing Language Model Probability Assignments
A system's primary goal is to predict the probability of a sequence of tokens. To calculate the total probability for the sequence 'The quick brown fox', it breaks the problem down into a series of conditional probability calculations. Arrange the following calculations in the correct order that the system would use to find the total probability of the sequence.
Evaluating a Language Model's Probabilistic Output
Chain Rule for Sequence Probability
Conditional Probability of the Next Token
A model is generating a sequence of words. It has already produced the words 'The', 'quick', 'brown'. According to the principle of autoregressive conditional probability, which expression correctly represents the likelihood that the next word will be 'fox', given the preceding words?
Defining Probability for a Token in a Sequence
A model is generating a sequence of elements (x₀, x₁, x₂, x₃, ...). To calculate the probability of the fourth element (x₃), the model's calculation must be conditioned on the entire preceding subsequence (x₀, x₁, x₂). A simplified model that conditions the probability of x₃ only on the immediately preceding element (x₂) would still be correctly applying the principle of autoregressive conditional probability.