Learn Before
  • Language Models (LMs)

  • Autoregressive Conditional Probability

Chain Rule for Sequence Probability

The chain rule of probability is a fundamental principle used in language modeling to calculate the joint probability of a sequence of tokens, such as {x_0, x_1, ..., x_m}. It works by decomposing the joint probability into a product of conditional probabilities. Each term in this product, Pr(xix0,...,xi1)\text{Pr}(x_i|x_0, ..., x_{i-1}), represents the probability of token xix_i occurring, given all the tokens that came before it. The general formula is: Pr(x0,...,xm)=Pr(x0)Pr(x1x0)Pr(x2x0,x1)Pr(xmx0,...,xm1)\text{Pr}(x_0, ..., x_m) = \text{Pr}(x_0) \cdot \text{Pr}(x_1|x_0) \cdot \text{Pr}(x_2|x_0, x_1) \cdots \text{Pr}(x_m|x_0, ..., x_{m-1}) This can be written more compactly using product notation: =i=0mPr(xix0,...,xi1)= \prod_{i=0}^{m} \text{Pr}(x_i|x_0, ..., x_{i-1}) By convention, the conditional probability for the first term, where i=0, is simply the marginal probability (\text{Pr}(x_0)). For practical applications, this formula can be expressed in a logarithmic form, which converts the product into a sum and improves numerical stability.

Image 0

0

1

2 months ago

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related
  • Types of Language Models

  • Evaluating language models

  • Shannon's Foundational Work on Language Modeling

  • Generalization of the Language Modeling Concept

  • Chain Rule for Sequence Probability

  • Deep Learning Approach to Language Modeling

  • Output Token Sequence in LLMs

  • Start of Sentence (SOS) Token

  • [CLS] Token as a Start Symbol

  • A system is designed to predict the probability of a sequence of words. For the sequence 'The dog ran', the system provides the following conditional probabilities:

    • The probability of 'The' occurring at the start of a sequence is 0.2.
    • The probability of 'dog' occurring after 'The' is 0.3.
    • The probability of 'ran' occurring after 'The dog' is 0.7.

    Based on the fundamental principle used by such systems to determine the likelihood of a full sequence, what is the overall probability of the sequence 'The dog ran'?

  • Analyzing Language Model Probability Assignments

  • A system's primary goal is to predict the probability of a sequence of tokens. To calculate the total probability for the sequence 'The quick brown fox', it breaks the problem down into a series of conditional probability calculations. Arrange the following calculations in the correct order that the system would use to find the total probability of the sequence.

  • Evaluating a Language Model's Probabilistic Output

  • Chain Rule for Sequence Probability

  • Conditional Probability of the Next Token

  • A model is generating a sequence of words. It has already produced the words 'The', 'quick', 'brown'. According to the principle of autoregressive conditional probability, which expression correctly represents the likelihood that the next word will be 'fox', given the preceding words?

  • Defining Probability for a Token in a Sequence

  • A model is generating a sequence of elements (x₀, x₁, x₂, x₃, ...). To calculate the probability of the fourth element (x₃), the model's calculation must be conditioned on the entire preceding subsequence (x₀, x₁, x₂). A simplified model that conditions the probability of x₃ only on the immediately preceding element (x₂) would still be correctly applying the principle of autoregressive conditional probability.

Learn After
  • Base Case for Sequence Probability

  • Joint Probability of a Generated Sequence using the Chain Rule

  • Relationship Between Joint, Conditional, and Marginal Log-Probabilities of Sequences

  • Derivation of Sequence Log-Probability via Chain Rule

  • Logarithmic Form of the Chain Rule for Sequence Probability

  • Formula for an Impossible Initial Event

  • A language model is tasked with calculating the total probability of the three-token sequence 'the cat sat'. The model provides the following probability estimates:

    • The probability of the first token is Pr("the") = 0.1
    • The probability of the second token, given the first, is Pr("cat" | "the") = 0.5
    • The probability of the third token, given the first two, is Pr("sat" | "the", "cat") = 0.8

    Using the principle that the joint probability of a sequence is the product of the conditional probabilities of its components, what is the joint probability Pr("the", "cat", "sat")?

  • Computational Stability of Sequence Probability

  • Which of the following expressions correctly decomposes the joint probability of a four-token sequence (x₁, x₂, x₃, x₄) using the chain rule of probability?