Formula

Probability Factorization for Arbitrary Order Token Prediction

Unlike standard language models that predict tokens strictly from left to right, a model can utilize a non-sequential prediction order, which modifies the joint probability factorization. For example, if tokens are generated in the specific sequence x0x4x2x1x3x_0 \rightarrow x_4 \rightarrow x_2 \rightarrow x_1 \rightarrow x_3, the generation process is defined as: Pr(x)=Pr(x0)Pr(x4e0)Pr(x2e0,e4)Pr(x1e0,e4,e2)Pr(x3e0,e4,e2,e1)\Pr(\mathbf{x}) = \Pr(x_0) \cdot \Pr(x_4|\mathbf{e}_0) \cdot \Pr(x_2|\mathbf{e}_0, \mathbf{e}_4) \cdot \Pr(x_1|\mathbf{e}_0, \mathbf{e}_4, \mathbf{e}_2) \cdot \Pr(x_3|\mathbf{e}_0, \mathbf{e}_4, \mathbf{e}_2, \mathbf{e}_1) In this equation, ei\mathbf{e}_i represents the embedding for token xix_i. Because these embeddings incorporate positional information, the original sequence order is maintained. This alternative approach allows token generation to be conditioned on a broader context. Specifically, when predicting token x3x_3, the model leverages both its left-context (e0,e1,e2\mathbf{e}_0, \mathbf{e}_1, \mathbf{e}_2) and right-context (e4\mathbf{e}_4). As a result, this approach is somewhat akin to masked language modeling: we conceptually mask out x3x_3 and use its surrounding tokens to predict it.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related