The network head is the final decision-making component of a vision Transformer architecture. After the full sequence of tokens has been processed by the stack of Transformer encoder blocks, the network extracts only the output representation corresponding to the `<cls>` token. This specific representation is then projected by the network head—typically implemented as a simple sequence of layer normalization and a linear layer—to produce the final outputs, such as class predictions.

Claude

The vision Transformer encoder block processes sequences of image patches and is characterized by its pre-normalization architecture. Within this block, layer normalization is applied right before both the multi-head attention mechanism and the multilayer perceptron (MLP). This pre-normalization strategy generally leads to more effective and efficient training compared to the post-normalization design found in the original Transformer. Furthermore, similar to standard Transformer blocks, a vision Transformer encoder block preserves the exact shape of its input throughout its operations.

Vision Transformer Encoder Block

Dive into Deep Learning

The vision Transformer encoder block is implemented by combining layer normalization, multi-head attention, and a specialized multilayer perceptron (MLP). Adhering to the pre-normalization design, the input tensor is first normalized before passing through the attention mechanism, and the result is added to the original input via a residual connection. This intermediate output is then normalized again before being processed by the MLP, followed by a second residual connection. This design ensures that the structural flow remains stable and the dimensions are preserved.

```python
# PyTorch
class ViTBlock(nn.Module):
    def __init__(self, num_hiddens, norm_shape, mlp_num_hiddens,
                 num_heads, dropout, use_bias=False):
        super().__init__()
        self.ln1 = nn.LayerNorm(norm_shape)
        self.attention = d2l.MultiHeadAttention(num_hiddens, num_heads,
                                                dropout, use_bias)
        self.ln2 = nn.LayerNorm(norm_shape)
        self.mlp = ViTMLP(mlp_num_hiddens, num_hiddens, dropout)

    def forward(self, X, valid_lens=None):
        X = X + self.attention(*([self.ln1(X)] * 3), valid_lens)
        return X + self.mlp(self.ln2(X))
```

```python
# JAX
class ViTBlock(nn.Module):
    num_hiddens: int
    mlp_num_hiddens: int
    num_heads: int
    dropout: float
    use_bias: bool = False

    def setup(self):
        self.attention = d2l.MultiHeadAttention(self.num_hiddens, self.num_heads,
                                                self.dropout, self.use_bias)
        self.mlp = ViTMLP(self.mlp_num_hiddens, self.num_hiddens, self.dropout)

    @nn.compact
    def __call__(self, X, valid_lens=None, training=False):
        X = X + self.attention(*([nn.LayerNorm()(X)] * 3),
                               valid_lens, training=training)[0]
        return X + self.mlp(nn.LayerNorm()(X), training=training)
```

Implementation of the Vision Transformer Encoder Block

A fundamental property of the vision Transformer encoder block is that it does not alter the shape of its input tensor. Regardless of the internal transformations applied by the pre-normalization, multi-head attention, and the multilayer perceptron (MLP), the final output sequence will maintain the exact same dimensionality as the input sequence. For instance, an input tensor of shape $$(2, 100, 24)$$ processed by the encoder block will yield an output tensor of shape $$(2, 100, 24)$$.

```python
# PyTorch
X = torch.ones((2, 100, 24))
encoder_blk = ViTBlock(24, 24, 48, 8, 0.5)
encoder_blk.eval()
d2l.check_shape(encoder_blk(X), X.shape)
```

```python
# JAX
X = jnp.ones((2, 100, 24))
encoder_blk = ViTBlock(24, 48, 8, 0.5)
d2l.check_shape(encoder_blk.init_with_output(d2l.get_key(), X)[0], X.shape)
```

Learn Before

Related