1Cademy - Comparison of Self-Attention Masking Results

Learn Before

Masks for Self-attention

Comparison

Comparison of Self-Attention Masking Results

A comparison of the self-attention masking results across causal language modeling, masked language modeling, and permuted language modeling can be visualized using matrices. In these representations, a blue cell at coordinates $(i,j)$ signifies valid attention, indicating that the token at position $j$ attends to the token at position $i$ . Conversely, a gray cell denotes blocked attention, meaning the token at position $j$ does not attend to the token at position $i$ . Additionally, $\mathbf{e}_{\mathrm{mask}}$ represents the embedding of the symbol $[\mathrm{MASK}]$ , which combines both the token embedding and the positional embedding.

0

1

Updated 2026-04-16

Contributors are:

Who are from:

References

Learn Before

Related