Learn Before
Concept

Tensor Manipulation for Parallel Attention Heads

To compute the hh heads of a multi-head attention mechanism in parallel, proper tensor manipulation is necessary to align the data for the underlying attention pooling function. The input tensors containing the concatenated queries, keys, and values—typically of shape (extbatch_size,extnum_queries,extnum_hiddens)( ext{batch\_size}, ext{num\_queries}, ext{num\_hiddens})—are first reshaped to explicitly separate the hh heads, yielding a shape of (extbatch_size,extnum_queries,h,extnum_hiddens/h)( ext{batch\_size}, ext{num\_queries}, h, ext{num\_hiddens} / h). A transposition operation then swaps the sequence length dimension with the head dimension. Finally, flattening the batch and head dimensions together results in a shape of (extbatch_sizeimesh,extnum_queries,extnum_hiddens/h)( ext{batch\_size} imes h, ext{num\_queries}, ext{num\_hiddens} / h). This layout allows a standard attention function to process all heads simultaneously. Following the attention computation, a reverse sequence of transpositions and reshapes is applied to concatenate the individual head outputs back into a single tensor.

0

1

Updated 2026-05-14

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L

Related