Formula

Chain Rule for Sequence Probability

The chain rule of probability is a core mathematical concept used in language modeling to determine the joint probability of a sequence of tokens, denoted as x0,x1,,xmx_0, x_1, \dots, x_m. This rule breaks down the overall sequence probability into a product of individual conditional probabilities. Specifically, the probability of the entire sequence is calculated as: Pr(x0,,xm)=Pr(x0)Pr(x1x0)Pr(x2x0,x1)Pr(xmx0,,xm1)\Pr(x_0, \dots, x_m) = \Pr(x_0) \cdot \Pr(x_1|x_0) \cdot \Pr(x_2|x_0, x_1) \cdots \Pr(x_m|x_0, \dots, x_{m-1}). Using the compact product notation, this equates to: Pr(x0,,xm)=i=0mPr(xix0,,xi1)\Pr(x_0, \dots, x_m) = \prod_{i=0}^{m} \Pr(x_i|x_0, \dots, x_{i-1}). Furthermore, to enhance computational efficiency and numerical stability, this product is frequently transformed into an alternative logarithmic form.

Image 0

0

1

Updated 2026-04-18

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related