1Cademy - Standard Auto-Regressive Probability Factorization using Embeddings

Learn Before

Formula

Standard Auto-Regressive Probability Factorization using Embeddings

In standard auto-regressive language models, the joint probability of a token sequence is factored using the chain rule of probability. In neural network implementations, this conditioning on previous tokens is practically achieved by using their embeddings. This relationship for a sequence $\mathbf{x} = (x_0, ..., x_4)$ can be expressed with the following formula, which shows the equivalence between the probabilistic formulation and its neural network counterpart: $\text{Pr}(\mathbf{x}) = \text{Pr}(x_0) \cdot \text{Pr}(x_1|x_0) \cdot \text{Pr}(x_2|x_0, x_1) \cdot \text{Pr}(x_3|x_0, x_1, x_2) \cdot \text{Pr}(x_4|x_0, x_1, x_2, x_3) \\ = \text{Pr}(x_0) \cdot \text{Pr}(x_1|\mathbf{e}_0) \cdot \text{Pr}(x_2|\mathbf{e}_0, \mathbf{e}_1) \cdot \text{Pr}(x_3|\mathbf{e}_0, \mathbf{e}_1, \mathbf{e}_2) \cdot \text{Pr}(x_4|\mathbf{e}_0, \mathbf{e}_1, \mathbf{e}_2, \mathbf{e}_3)$ where $\mathbf{e}_i$ represents the embedding of token $x_i$ .

0

1

Updated 2026-05-02

Contributors are:

Who are from:

References

Learn Before

Related

Learn After