Visual Representation of Permuted Language Modeling
This diagram provides a visual representation of Permuted Language Modeling, where a sequence is generated in a non-sequential, permuted order. In this example, the prediction order is . Each row illustrates a step in the generation process, where the blue squares indicate the tokens that have already been generated and are used as context for predicting the next token. The step-by-step conditional probabilities are shown on the right:
- Step 1 (Predict ): The process starts with , which is treated as a given starting point. Its probability is set to 1: .
- Step 2 (Predict ): The model predicts conditioned on the embedding of : .
- Step 3 (Predict ): The model predicts conditioned on the embeddings of the already generated tokens, and : .
- Step 4 (Predict ): The model predicts using the context of and : .
- Step 5 (Predict ): Finally, the model predicts conditioned on all other tokens: .

0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Comparison of Arbitrary Order Prediction and Masked Language Modeling
Visual Representation of Permuted Language Modeling
A language model is tasked with generating a four-token sequence, originally ordered as
(x_0, x_1, x_2, x_3). Instead of a standard left-to-right approach, the model generates the tokens in the following arbitrary order:x_2 → x_0 → x_3 → x_1. Given this generation order, which expression correctly represents the conditional probability for predicting the final token,x_1? (Note:e_irepresents the embedding of tokenx_i)Contextual Advantages of Non-Sequential Token Generation
A language model is generating a five-token sequence, originally ordered as
(x_0, x_1, x_2, x_3, x_4). The model generates the tokens in the following arbitrary order:x_3 → x_1 → x_4 → x_0 → x_2. Arrange the conditional probability terms below to correctly represent the joint probability factorization for this specific generation order. (Note:e_irepresents the embedding of tokenx_i.)
Learn After
A language model is tasked with generating a five-token sequence () in the specific permuted order: . At each step, the model predicts the next token in the permutation using the embeddings (e.g., for token ) of all previously generated tokens as context. Which of the following correctly represents the conditional probability for the third step of this generation process?
A language model generates a four-token sequence () using the specific permuted order: . Arrange the following conditional probability expressions to match this generation sequence, where represents the embedding of token .
A language model is generating a five-token sequence () using a permuted, non-sequential order. At a specific step in the generation process, the model calculates the probability for token as: , where is the embedding of token . Based only on this information, what can be definitively concluded about the generation process?