1Cademy - Probability Factorization for Arbitrary Order Token Prediction

Learn Before

Formula

Probability Factorization for Arbitrary Order Token Prediction

Unlike standard language models that predict tokens strictly from left to right, a model can utilize a non-sequential prediction order, which modifies the joint probability factorization. For example, if tokens are generated in the specific sequence $x_0 \rightarrow x_4 \rightarrow x_2 \rightarrow x_1 \rightarrow x_3$ , the generation process is defined as: $\Pr(\mathbf{x}) = \Pr(x_0) \cdot \Pr(x_4|\mathbf{e}_0) \cdot \Pr(x_2|\mathbf{e}_0, \mathbf{e}_4) \cdot \Pr(x_1|\mathbf{e}_0, \mathbf{e}_4, \mathbf{e}_2) \cdot \Pr(x_3|\mathbf{e}_0, \mathbf{e}_4, \mathbf{e}_2, \mathbf{e}_1)$ In this equation, $\mathbf{e}_i$ represents the embedding for token $x_i$ . Because these embeddings incorporate positional information, the original sequence order is maintained. This alternative approach allows token generation to be conditioned on a broader context. Specifically, when predicting token $x_3$ , the model leverages both its left-context ( $\mathbf{e}_0, \mathbf{e}_1, \mathbf{e}_2$ ) and right-context ( $\mathbf{e}_4$ ). As a result, this approach is somewhat akin to masked language modeling: we conceptually mask out $x_3$ and use its surrounding tokens to predict it.

Updated 2026-05-02

Contributors are:

Who are from:

References

Learn Before

Related

Learn After