To implement self-attention using a multi-head attention module, the same input tensor is passed as the queries, keys, and values. For a batch of sequences represented by a tensor `X` with shape `(batch_size, num_steps, num_hiddens)`, the self-attention computation outputs a tensor of the exact same shape. The following PyTorch snippet demonstrates this using a `MultiHeadAttention` class:

```python
num_hiddens, num_heads = 100, 5
attention = d2l.MultiHeadAttention(num_hiddens, num_heads, 0.5)
batch_size, num_queries, valid_lens = 2, 4, torch.tensor([3, 2])
X = torch.ones((batch_size, num_queries, num_hiddens))

# Self-attention passes X as queries, keys, and values
d2l.check_shape(attention(X, X, X, valid_lens),
                (batch_size, num_queries, num_hiddens))
```

Claude

Given a sequence of input tokens $$\mathbf{x}_1, \ldots, \mathbf{x}_n$$ where each token $$\mathbf{x}_i \in \mathbb{R}^d$$ for $$1 \leq i \leq n$$, the self-attention mechanism produces an output sequence of the same length, denoted as $$\mathbf{y}_1, \ldots, \mathbf{y}_n$$. Each output vector $$\mathbf{y}_i$$ is computed by treating the token $$\mathbf{x}_i$$ as the query, and the entire sequence of tokens as both the keys and the values. This is mathematically defined as:

$$ \mathbf{y}_i = f(\mathbf{x}_i, (\mathbf{x}_1, \mathbf{x}_1), \ldots, (\mathbf{x}_n, \mathbf{x}_n)) \in \mathbb{R}^d $$

where $$f$$ represents the general attention pooling function.

Self-Attention Output Formula

Dive into Deep Learning

Self-Attention Tensor Computation Example

In self-attention mechanisms, the queries, keys, and values all originate from the exact same place. Instead of drawing these elements from separate sequences, the model derives the query, key, and value vectors directly from the single input sequence, allowing every token to attend to all other tokens within that same sequence.

Learn Before

Related