Example

Multi-Head Attention Computation Example

To verify the implementation of a multi-head attention mechanism, we can construct a toy example using randomly generated tensors. In this scenario, we instantiate a MultiHeadAttention module with num_hiddens set to 100100 and num_heads set to 55. When passing a batch of queries (shape (batch_size,num_queries,num_hiddens)(\text{batch\_size}, \text{num\_queries}, \text{num\_hiddens})) along with identical keys and values (shape (batch_size,num_kvpairs,num_hiddens)(\text{batch\_size}, \text{num\_kvpairs}, \text{num\_hiddens})) and valid lengths, the final output tensor successfully retains the query sequence length and the specified hidden dimensionality.

num_hiddens, num_heads = 100, 5 attention = MultiHeadAttention(num_hiddens, num_heads, 0.5) batch_size, num_queries, num_kvpairs = 2, 4, 6 valid_lens = torch.tensor([3, 2]) X = torch.ones((batch_size, num_queries, num_hiddens)) Y = torch.ones((batch_size, num_kvpairs, num_hiddens)) d2l.check_shape(attention(X, Y, Y, valid_lens), (batch_size, num_queries, num_hiddens))

0

1

Updated 2026-05-14

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L