Formula

Standard Auto-Regressive Probability Factorization using Embeddings

In standard auto-regressive language models, the joint probability of a token sequence is factored using the chain rule of probability. In neural network implementations, this conditioning on previous tokens is practically achieved by using their embeddings. This relationship for a sequence x=(x0,...,x4)\mathbf{x} = (x_0, ..., x_4) can be expressed with the following formula, which shows the equivalence between the probabilistic formulation and its neural network counterpart: Pr(x)=Pr(x0)Pr(x1x0)Pr(x2x0,x1)Pr(x3x0,x1,x2)Pr(x4x0,x1,x2,x3)=Pr(x0)Pr(x1e0)Pr(x2e0,e1)Pr(x3e0,e1,e2)Pr(x4e0,e1,e2,e3)\text{Pr}(\mathbf{x}) = \text{Pr}(x_0) \cdot \text{Pr}(x_1|x_0) \cdot \text{Pr}(x_2|x_0, x_1) \cdot \text{Pr}(x_3|x_0, x_1, x_2) \cdot \text{Pr}(x_4|x_0, x_1, x_2, x_3) \\ = \text{Pr}(x_0) \cdot \text{Pr}(x_1|\mathbf{e}_0) \cdot \text{Pr}(x_2|\mathbf{e}_0, \mathbf{e}_1) \cdot \text{Pr}(x_3|\mathbf{e}_0, \mathbf{e}_1, \mathbf{e}_2) \cdot \text{Pr}(x_4|\mathbf{e}_0, \mathbf{e}_1, \mathbf{e}_2, \mathbf{e}_3) where ei\mathbf{e}_i represents the embedding of token xix_i.

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences