Comparison

Comparison of Arbitrary Order Prediction and Masked Language Modeling

Predicting tokens in an arbitrary, permuted order allows for generation to be conditioned on a broader context, sharing conceptual similarities with Masked Language Modeling (MLM). Rather than being limited to just the preceding tokens as in standard models, it enables the use of bidirectional context. For example, when generating token x3x_3, the model might consider both its left-context (embeddings e0,e1,e2\mathbf{e}_0, \mathbf{e}_1, \mathbf{e}_2) and its right-context (embedding e4\mathbf{e}_4). Because these embeddings incorporate the positional information of their respective tokens (x0,x1,x2,x4x_0, x_1, x_2, x_4), the original sequence order is preserved. Consequently, this approach functions similarly to MLM: it is as if x3x_3 is masked out, and the model uses its surrounding tokens to predict it.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related