Learn Before
Concept

Implementing Permutation via Self-Attention Masks

The implementation of permuted language models in Transformers is relatively straightforward because the self-attention mechanism is inherently insensitive to the order of inputs. Therefore, it is not necessary to explicitly reorder the sequence to achieve a permuted probability factorization. Instead, the desired permutation is implemented by applying appropriate masks to the self-attention layers. For instance, to compute Pr(x1e0,e4,e2)\Pr(x_1|\mathbf{e}_0, \mathbf{e}_4, \mathbf{e}_2), the sequence x0,x1,x2,x3,x4x_0, x_1, x_2, x_3, x_4 can remain in its original order. The self-attention mask is then configured to block the attention from certain tokens—such as blocking the attention from x3x_3 to x1x_1 in self-attention—thereby correctly enforcing the targeted permuted prediction order.

0

1

Updated 2026-04-15

Contributors are:

Who are from:

Tags

Foundations of Large Language Models

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences