Probability Factorization for Arbitrary Order Token Prediction
Unlike standard language models that predict tokens strictly from left to right, a model can utilize a non-sequential prediction order, which modifies the joint probability factorization. For example, if tokens are generated in the specific sequence , the generation process is defined as: In this equation, represents the embedding for token . Because these embeddings incorporate positional information, the original sequence order is maintained. This alternative approach allows token generation to be conditioned on a broader context. Specifically, when predicting token , the model leverages both its left-context () and right-context (). As a result, this approach is somewhat akin to masked language modeling: we conceptually mask out and use its surrounding tokens to predict it.

0
1
References
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Probability Factorization for Arbitrary Order Token Prediction
Step-by-Step Example of Auto-Regressive Sequence Generation
Standard Auto-Regressive Probability Factorization using Embeddings
A language model is designed to calculate the likelihood of a text sequence by predicting each token based only on the tokens that have come before it. Given the three-token sequence 'The quick brown', which of the following expressions correctly represents how this model would calculate the total probability of the entire sequence?
Example of Auto-Regressive Probability Calculation
Calculating Sequence Probability in an Auto-regressive Model
Debugging a Sequence Probability Calculation
Probability Factorization for Arbitrary Order Token Prediction
A language model is pre-trained using an objective where, for the input sentence 'The model learns from text', it might be tasked to predict the word 'learns' based on the context of 'text' and 'The', while the word 'model' is not yet visible to it. In the next step, it might predict 'model' based on 'text', 'The', and the newly predicted 'learns'. What is the primary advantage of this training approach compared to a standard left-to-right sequential prediction?
A language model is being pre-trained on the sentence 'The quick brown fox jumps' using a permuted objective. The model is given a random permutation of the token positions: (3, 5, 1, 4, 2). Arrange the words from the sentence in the order they will be auto-regressively predicted during this training step.
Pre-training Objective Selection
Comparison of Permuted and Causal Language Modeling
Implementing Permutation via Self-Attention Masks
Probability Factorization for Arbitrary Order Token Prediction
Causal Language Modeling
An auto-regressive neural network is calculating the joint probability of the token sequence
(x_0, x_1, x_2, x_3). To do this, it must compute the conditional probability for the final token, expressed asPr(x_3 | x_0, x_1, x_2). Which statement best analyzes how the neural network practically implements this probabilistic conditioning?Neural Network Probability Factorization
An auto-regressive neural network is tasked with calculating the total probability of the three-token sequence
(x_0, x_1, x_2). Arrange the following computational steps in the correct chronological order that the model would follow, wheree_irepresents the embedding for tokenx_i.
Learn After
Comparison of Arbitrary Order Prediction and Masked Language Modeling
Visual Representation of Permuted Language Modeling
A language model is tasked with generating a four-token sequence, originally ordered as
(x_0, x_1, x_2, x_3). Instead of a standard left-to-right approach, the model generates the tokens in the following arbitrary order:x_2 → x_0 → x_3 → x_1. Given this generation order, which expression correctly represents the conditional probability for predicting the final token,x_1? (Note:e_irepresents the embedding of tokenx_i)Contextual Advantages of Non-Sequential Token Generation
A language model is generating a five-token sequence, originally ordered as
(x_0, x_1, x_2, x_3, x_4). The model generates the tokens in the following arbitrary order:x_3 → x_1 → x_4 → x_0 → x_2. Arrange the conditional probability terms below to correctly represent the joint probability factorization for this specific generation order. (Note:e_irepresents the embedding of tokenx_i.)