In self-attention mechanisms, masks dictate which tokens within a sequence are allowed to interact with one another. This can be conceptualized by distinguishing between valid attention, where information is permitted to flow between tokens, and blocked attention, where the interaction is explicitly suppressed. For example, when processing a sequence of tokens from $$x_0$$ to $$x_4$$, a specific mask might allow a token like $$x_1$$ to receive valid attention from $$x_0$$, $$x_2$$, and $$x_4$$, while assigning blocked attention to $$x_3$$. By using these masks, models can selectively control the contextual information available to each token.

Masks for Self-attention

When evaluating architectures such as CNNs, RNNs, and self-attention for mapping an input sequence of $$n$$ tokens to an output sequence of the same length (with each token represented as a $$d$$-dimensional vector), three main properties are compared: computational complexity, sequential operations, and maximum path lengths. A smaller number of sequential operations is desirable as it allows for parallel computation, while a shorter maximum path length between tokens makes it easier for the network to learn long-range dependencies.

Comparing CNN, RNN, and Self-Attention Architectures

In the encoder, the data will first go through a module called ‘self-attention’ to get a weighted feature x.

Northeastern University (US)

Carleton College

- Multi-head self-attention: multiple attention projections are computed and then concatenated into a single $D_m$ representation

- Masked attention: self-attention modules in the decoder are adapted to prevent each position from attending to subsequent position

- Cross-attention: in the decoder, the queries are projected from the outputs of the previous (decoder) layer, whereas the keys and values are projected using the outputs of the encoder

Attention in vanilla Transformers

A very influential paper that introduced the concept of Transformer model.
https://arxiv.org/abs/1706.03762

Learn Before

Related

Learn After