Learn Before
Example
Multi-Head Attention Computation Example
To verify the implementation of a multi-head attention mechanism, we can construct a toy example using randomly generated tensors. In this scenario, we instantiate a MultiHeadAttention module with num_hiddens set to and num_heads set to . When passing a batch of queries (shape ) along with identical keys and values (shape ) and valid lengths, the final output tensor successfully retains the query sequence length and the specified hidden dimensionality.
num_hiddens, num_heads = 100, 5 attention = MultiHeadAttention(num_hiddens, num_heads, 0.5) batch_size, num_queries, num_kvpairs = 2, 4, 6 valid_lens = torch.tensor([3, 2]) X = torch.ones((batch_size, num_queries, num_hiddens)) Y = torch.ones((batch_size, num_kvpairs, num_hiddens)) d2l.check_shape(attention(X, Y, Y, valid_lens), (batch_size, num_queries, num_hiddens))
0
1
Updated 2026-05-14
Tags
D2L
Dive into Deep Learning @ D2L