1Cademy - Masks for Self-attention

Learn Before

Self-Attention of Transformer

Concept

Masks for Self-attention

In self-attention mechanisms, masks dictate which tokens within a sequence are allowed to interact with one another. This can be conceptualized by distinguishing between valid attention, where information is permitted to flow between tokens, and blocked attention, where the interaction is explicitly suppressed. For example, when processing a sequence of tokens from $x_0$ to $x_4$ , a specific mask might allow a token like $x_1$ to receive valid attention from $x_0$ , $x_2$ , and $x_4$ , while assigning blocked attention to $x_3$ . By using these masks, models can selectively control the contextual information available to each token.

0

1

Updated 2026-04-16

Contributors are:

Who are from:

References

Learn After

Comparison of Self-Attention Masking Results

Learn Before

Related

Learn After