Formula

Base Case for Sequence Probability

In the chain rule for sequence probability, the base case is the first token, x0x_0. Since there are no preceding tokens, its probability is its marginal probability, Pr(x0)\text{Pr}(x_0). In many language models, this initial token is a deterministic start-of-sequence symbol, meaning its probability is fixed at 1, i.e., Pr(x0)=1\text{Pr}(x_0) = 1. This assumption simplifies the joint probability calculation for the rest of the sequence. Specifically, the probability of the sequence following the initial token, Pr(x1,...,xmx0)\text{Pr}(x_1, ..., x_m|x_0), is unaffected when multiplied by Pr(x0)\text{Pr}(x_0), as shown by the equation: Pr(x0)Pr(x1,...,xmx0)=Pr(x1,...,xmx0)\text{Pr}(x_0) \text{Pr}(x_1, ..., x_m|x_0) = \text{Pr}(x_1, ..., x_m|x_0) This effectively means the calculation can start from the second token, conditioned on the first.

Image 0

0

1

Updated 2026-04-18

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences