1Cademy - Multi-Head Attention Computation Example

Learn Before

Multi-Head Attention Implementation

Example

Multi-Head Attention Computation Example

To verify the implementation of a multi-head attention mechanism, we can construct a toy example using randomly generated tensors. In this scenario, we instantiate a MultiHeadAttention module with num_hiddens set to 100 and num_heads set to 5. When passing a batch of queries (shape $(\text{batch\_size}, \text{num\_queries}, \text{num\_hiddens})$ ) along with identical keys and values (shape $(\text{batch\_size}, \text{num\_kvpairs}, \text{num\_hiddens})$ ) and valid lengths, the final output tensor successfully retains the query sequence length and the specified hidden dimensionality.

num_hiddens, num_heads = 100, 5
attention = MultiHeadAttention(num_hiddens, num_heads, 0.5)
batch_size, num_queries, num_kvpairs = 2, 4, 6
valid_lens = torch.tensor([3, 2])
X = torch.ones((batch_size, num_queries, num_hiddens))
Y = torch.ones((batch_size, num_kvpairs, num_hiddens))
d2l.check_shape(attention(X, Y, Y, valid_lens), (batch_size, num_queries, num_hiddens))

Updated 2026-06-19

Contributors are:

Who are from:

References

Dive into Deep Learning

Learn Before

Related