Google

Auto-regressive language models calculate the probability of a text sequence, $\mathbf{x}$, by decomposing it into a product of conditional probabilities using the chain rule. The probability of each token $x_i$ is conditioned on all preceding tokens in the sequence. The general formula for a sequence $\mathbf{x} = (x_0, ..., x_{m-1})$ is: $$ \text{Pr}(\mathbf{x}) = \prod_{i=0}^{m-1} \text{Pr}(x_i | x_0, ..., x_{i-1}) $$ For example, for a sequence of five tokens, this expands to: $$ \text{Pr}(x) = \text{Pr}(x_0) \cdot \text{Pr}(x_1|x_0) \cdot \text{Pr}(x_2|x_0, x_1) \cdot \text{Pr}(x_3|x_0, x_1, x_2) \cdot \text{Pr}(x_4|x_0, x_1, x_2, x_3) $$

Chain Rule of Probability for Auto-regressive Language Models

Unlike standard language models that predict tokens strictly from left to right, a model can utilize a non-sequential prediction order, which modifies the joint probability factorization. For example, if tokens are generated in the specific sequence $$x_0 \rightarrow x_4 \rightarrow x_2 \rightarrow x_1 \rightarrow x_3$$, the generation process is defined as: $$ \Pr(\mathbf{x}) = \Pr(x_0) \cdot \Pr(x_4|\mathbf{e}_0) \cdot \Pr(x_2|\mathbf{e}_0, \mathbf{e}_4) \cdot \Pr(x_1|\mathbf{e}_0, \mathbf{e}_4, \mathbf{e}_2) \cdot \Pr(x_3|\mathbf{e}_0, \mathbf{e}_4, \mathbf{e}_2, \mathbf{e}_1) $$ In this equation, $$\mathbf{e}_i$$ represents the embedding for token $$x_i$$. Because these embeddings incorporate positional information, the original sequence order is maintained. This alternative approach allows token generation to be conditioned on a broader context. Specifically, when predicting token $$x_3$$, the model leverages both its left-context ($$\mathbf{e}_0, \mathbf{e}_1, \mathbf{e}_2$$) and right-context ($$\mathbf{e}_4$$). As a result, this approach is somewhat akin to masked language modeling: we conceptually mask out $$x_3$$ and use its surrounding tokens to predict it.

Probability Factorization for Arbitrary Order Token Prediction

An auto-regressive language model generates text one token at a time, where each new token is predicted based on the sequence of tokens that came before it. The overall probability of the generated sequence is calculated by multiplying the conditional probabilities of each token. The following table illustrates this process for generating the three tokens $$b$$, $$c$$, and $$d$$ given the prefix $$\langle s \rangle\ a$$:

| Context | Predicted Token | Decision Rule | Cumulative Sequence Probability |
| :--- | :--- | :--- | :--- |
| $$\langle s \rangle\ a$$ | $$b$$ | $$\argmax_{x_2 \in V} \Pr(x_{2}|\langle s \rangle\ a)$$ | $$\Pr(\langle s \rangle) \cdot \Pr(a|\langle s \rangle) \cdot \Pr(b|\langle s \rangle\ a)$$ |
| $$\langle s \rangle\ a\ b$$ | $$c$$ | $$\argmax_{x_3 \in V} \Pr(x_{3}|\langle s \rangle\ a\ b)$$ | $$\Pr(\langle s \rangle) \cdot \Pr(a|\langle s \rangle) \cdot \Pr(b|\langle s \rangle\ a) \cdot \Pr(c|\langle s \rangle\ a\ b)$$ |
| $$\langle s \rangle\ a\ b\ c$$ | $$d$$ | $$\argmax_{x_4 \in V} \Pr(x_{4}|\langle s \rangle\ a\ b\ c)$$ | $$\Pr(\langle s \rangle) \cdot \Pr(a|\langle s \rangle) \cdot \Pr(b|\langle s \rangle\ a) \cdot \Pr(c|\langle s \rangle\ a\ b) \cdot \Pr(d|\langle s \rangle\ a\ b\ c)$$ |

At each step, the model selects a token $$x_i$$ from the vocabulary $$V$$ so that the conditional probability $$\Pr(x_{i}|x_0,...,x_{i-1})$$ is maximized. This predicted token is then appended to the end of the context sequence for the next step.

Step-by-Step Example of Auto-Regressive Sequence Generation

In standard auto-regressive language models, the joint probability of a token sequence is factored using the chain rule of probability. In neural network implementations, this conditioning on previous tokens is practically achieved by using their embeddings. This relationship for a sequence $\mathbf{x} = (x_0, ..., x_4)$ can be expressed with the following formula, which shows the equivalence between the probabilistic formulation and its neural network counterpart: $$ \text{Pr}(\mathbf{x}) = \text{Pr}(x_0) \cdot \text{Pr}(x_1|x_0) \cdot \text{Pr}(x_2|x_0, x_1) \cdot \text{Pr}(x_3|x_0, x_1, x_2) \cdot \text{Pr}(x_4|x_0, x_1, x_2, x_3) \\ = \text{Pr}(x_0) \cdot \text{Pr}(x_1|\mathbf{e}_0) \cdot \text{Pr}(x_2|\mathbf{e}_0, \mathbf{e}_1) \cdot \text{Pr}(x_3|\mathbf{e}_0, \mathbf{e}_1, \mathbf{e}_2) \cdot \text{Pr}(x_4|\mathbf{e}_0, \mathbf{e}_1, \mathbf{e}_2, \mathbf{e}_3) $$ where $\mathbf{e}_i$ represents the embedding of token $x_i$.

Standard Auto-Regressive Probability Factorization using Embeddings

A language model is designed to calculate the likelihood of a text sequence by predicting each token based only on the tokens that have come before it. Given the three-token sequence 'The quick brown', which of the following expressions correctly represents how this model would calculate the total probability of the entire sequence?

This example illustrates how an auto-regressive language model calculates the probability of a sentence like 'The kitten is chasing the ball .' by breaking it down into a sequence of conditional probabilities. The model predicts each word based on the words that precede it, represented by the series of calculations: `Pr(·|The)`, `Pr(·|kitten)`, and so on, until the entire sentence is processed.

Example of Auto-Regressive Probability Calculation

An auto-regressive language model provides the following conditional probabilities for a sequence of tokens: `Pr('The') = 0.05`, `Pr('cat' | 'The') = 0.1`, and `Pr('sat' | 'The', 'cat') = 0.2`. Based on the principle of decomposing a sequence's probability into a product of conditional probabilities, calculate the total probability of the sequence 'The cat sat'. Show your calculation.

Calculating Sequence Probability in an Auto-regressive Model

Analyze the following scenario and explain the fundamental error in the junior data scientist's approach based on the principles of auto-regressive probability calculation.

Learn Before

Related