Learn Before
Implementing Permutation via Self-Attention Masks
The implementation of permuted language models in Transformers is relatively straightforward because the self-attention mechanism is inherently insensitive to the order of inputs. Therefore, it is not necessary to explicitly reorder the sequence to achieve a permuted probability factorization. Instead, the desired permutation is implemented by applying appropriate masks to the self-attention layers. For instance, to compute , the sequence can remain in its original order. The self-attention mask is then configured to block the attention from certain tokens—such as blocking the attention from to in self-attention—thereby correctly enforcing the targeted permuted prediction order.
0
1
Tags
Foundations of Large Language Models
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Probability Factorization for Arbitrary Order Token Prediction
A language model is pre-trained using an objective where, for the input sentence 'The model learns from text', it might be tasked to predict the word 'learns' based on the context of 'text' and 'The', while the word 'model' is not yet visible to it. In the next step, it might predict 'model' based on 'text', 'The', and the newly predicted 'learns'. What is the primary advantage of this training approach compared to a standard left-to-right sequential prediction?
A language model is being pre-trained on the sentence 'The quick brown fox jumps' using a permuted objective. The model is given a random permutation of the token positions: (3, 5, 1, 4, 2). Arrange the words from the sentence in the order they will be auto-regressively predicted during this training step.
Pre-training Objective Selection
Comparison of Permuted and Causal Language Modeling
Implementing Permutation via Self-Attention Masks