Learn Before
Code
Self-Attention Tensor Computation Example
To implement self-attention using a multi-head attention module, the same input tensor is passed as the queries, keys, and values. For a batch of sequences represented by a tensor X with shape (batch_size, num_steps, num_hiddens), the self-attention computation outputs a tensor of the exact same shape. The following PyTorch snippet demonstrates this using a MultiHeadAttention class:
num_hiddens, num_heads = 100, 5 attention = d2l.MultiHeadAttention(num_hiddens, num_heads, 0.5) batch_size, num_queries, valid_lens = 2, 4, torch.tensor([3, 2]) X = torch.ones((batch_size, num_queries, num_hiddens)) # Self-attention passes X as queries, keys, and values d2l.check_shape(attention(X, X, X, valid_lens), (batch_size, num_queries, num_hiddens))
0
1
Updated 2026-05-14
Tags
D2L
Dive into Deep Learning @ D2L