Learn Before
Code

Output Shape of the Vision Transformer Encoder Block

A fundamental property of the vision Transformer encoder block is that it does not alter the shape of its input tensor. Regardless of the internal transformations applied by the pre-normalization, multi-head attention, and the multilayer perceptron (MLP), the final output sequence will maintain the exact same dimensionality as the input sequence. For instance, an input tensor of shape (2,100,24)(2, 100, 24) processed by the encoder block will yield an output tensor of shape (2,100,24)(2, 100, 24).

# PyTorch X = torch.ones((2, 100, 24)) encoder_blk = ViTBlock(24, 24, 48, 8, 0.5) encoder_blk.eval() d2l.check_shape(encoder_blk(X), X.shape)
# JAX X = jnp.ones((2, 100, 24)) encoder_blk = ViTBlock(24, 48, 8, 0.5) d2l.check_shape(encoder_blk.init_with_output(d2l.get_key(), X)[0], X.shape)

0

1

Updated 2026-05-15

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L