1Cademy - Implementing Permutation via Self-Attention Masks

Learn Before

Permuted Language Modeling (PLM)

Concept

Implementing Permutation via Self-Attention Masks

The implementation of permuted language models in Transformers is relatively straightforward because the self-attention mechanism is inherently insensitive to the order of inputs. Therefore, it is not necessary to explicitly reorder the sequence to achieve a permuted probability factorization. Instead, the desired permutation is implemented by applying appropriate masks to the self-attention layers. For instance, to compute $\Pr(x_1|\mathbf{e}_0, \mathbf{e}_4, \mathbf{e}_2)$ , the sequence $x_0, x_1, x_2, x_3, x_4$ can remain in its original order. The self-attention mask is then configured to block the attention from certain tokens—such as blocking the attention from $x_3$ to $x_1$ in self-attention—thereby correctly enforcing the targeted permuted prediction order.

0

1

Updated 2026-04-15

Contributors are:

Who are from:

References

Learn Before

Related