Learn Before
Formula

Chain Rule of Probability for Auto-regressive Language Models

Auto-regressive language models calculate the probability of a text sequence, x\mathbf{x}, by decomposing it into a product of conditional probabilities using the chain rule. The probability of each token xix_i is conditioned on all preceding tokens in the sequence. The general formula for a sequence x=(x0,...,xm1)\mathbf{x} = (x_0, ..., x_{m-1}) is: Pr(x)=i=0m1Pr(xix0,...,xi1)\text{Pr}(\mathbf{x}) = \prod_{i=0}^{m-1} \text{Pr}(x_i | x_0, ..., x_{i-1}) For example, for a sequence of five tokens, this expands to: Pr(x)=Pr(x0)Pr(x1x0)Pr(x2x0,x1)Pr(x3x0,x1,x2)Pr(x4x0,x1,x2,x3)\text{Pr}(x) = \text{Pr}(x_0) \cdot \text{Pr}(x_1|x_0) \cdot \text{Pr}(x_2|x_0, x_1) \cdot \text{Pr}(x_3|x_0, x_1, x_2) \cdot \text{Pr}(x_4|x_0, x_1, x_2, x_3)

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences