Because the standard VGG-11 architecture is computationally demanding, it is common to construct a modified version with a smaller number of channels when training on simpler datasets like Fashion-MNIST. Rather than starting with $$64$$ channels, the network can be instantiated with a reduced architecture configuration, such as starting with $$16$$ channels and doubling them progressively (e.g., $$16$$, $$32$$, $$64$$, $$128$$, $$128$$). This reduced-capacity network remains more than sufficient for the Fashion-MNIST classification task while significantly accelerating the training process and exhibiting only a small amount of overfitting.

Training VGG-11 on Fashion-MNIST

We can trace the dimensionality transformations of an input image (e.g., with a spatial shape of $$ 224 \times 224 $$) as it passes through the VGG-11 network. The architecture halves the spatial height and width at each of the five VGG blocks due to the max-pooling operations. The resolution systematically drops from $$ 224 \times 224 $$ to $$ 112 \times 112 $$, $$ 56 \times 56 $$, $$ 28 \times 28 $$, $$ 14 \times 14 $$, and finally reaches $$ 7 \times 7 $$. Meanwhile, the number of channels progressively expands up to $$ 512 $$. The resulting $$ 512 \times 7 \times 7 $$ feature map is then flattened into a $$ 25088 $$-dimensional representation before being fed into the fully connected dense layers.

VGG-11 Layer-by-Layer Shape Trace

The original VGG network is commonly referred to as VGG-11 because it contains a total of eleven layers with learnable weights: eight convolutional layers and three fully connected layers. The convolutional feature extractor is constructed using five VGG blocks in sequence. The first two blocks contain one convolutional layer each, while the final three blocks contain two convolutional layers each. The network employs a strategy where the spatial dimensions are halved after each block while the number of feature channels doubles. Starting with $$64$$ output channels in the first block, the channels progressively double ($$128$$, $$256$$, $$512$$) until capping at $$512$$ in the final block, before the resulting feature map is flattened and fed into the fully connected dense layers.

Claude

The VGG network architecture is often considered the first truly modern Convolutional Neural Network (CNN) because it introduced a systematic, modular approach to building deep models. While earlier models like AlexNet proved the effectiveness of large-scale CNNs, VGG established key design properties: a preference for deep and narrow networks, and the use of modular blocks containing multiple consecutive convolutional layers. By chaining these VGG blocks together, the architecture defines an entire family of similarly parameterized models—such as VGG-11 or VGG-16—that allow practitioners to easily trade off between computational complexity and execution speed.

VGG Network Architecture

Dive into Deep Learning

VGG-11 Architecture

The VGG architecture can be implemented programmatically by defining a class that accepts an architecture configuration (`arch`), which is typically a list of tuples where each tuple specifies the number of convolutional layers and the output channels for a single VGG block. The network is constructed by iterating over this configuration to sequentially add the corresponding blocks, forming the convolutional feature extractor. After the sequence of blocks, the network includes a flatten operation followed by dense fully connected layers (often with dropout) to generate the classification output.

**PyTorch Implementation:**
```python
class VGG(d2l.Classifier):
    def __init__(self, arch, lr=0.1, num_classes=10):
        super().__init__()
        self.save_hyperparameters()
        conv_blks = []
        for (num_convs, out_channels) in arch:
            conv_blks.append(vgg_block(num_convs, out_channels))
        self.net = nn.Sequential(
            *conv_blks, nn.Flatten(),
            nn.LazyLinear(4096), nn.ReLU(), nn.Dropout(0.5),
            nn.LazyLinear(4096), nn.ReLU(), nn.Dropout(0.5),
            nn.LazyLinear(num_classes))
        self.net.apply(d2l.init_cnn)
```

**MXNet Implementation:**
```python
class VGG(d2l.Classifier):
    def __init__(self, arch, lr=0.1, num_classes=10):
        super().__init__()
        self.save_hyperparameters()
        self.net = nn.Sequential()
        for (num_convs, num_channels) in arch:
            self.net.add(vgg_block(num_convs, num_channels))
        self.net.add(nn.Dense(4096, activation='relu'), nn.Dropout(0.5),
                     nn.Dense(4096, activation='relu'), nn.Dropout(0.5),
                     nn.Dense(num_classes))
        self.net.initialize(init.Xavier())
```

**JAX Implementation:**
```python
class VGG(d2l.Classifier):
    arch: list
    lr: float = 0.1
    num_classes: int = 10
    training: bool = True

    def setup(self):
        conv_blks = []
        for (num_convs, out_channels) in self.arch:
            conv_blks.append(vgg_block(num_convs, out_channels))

        self.net = nn.Sequential([
            *conv_blks,
            lambda x: x.reshape((x.shape[0], -1)),  # flatten
            nn.Dense(4096), nn.relu,
            nn.Dropout(0.5, deterministic=not self.training),
            nn.Dense(4096), nn.relu,
            nn.Dropout(0.5, deterministic=not self.training),
            nn.Dense(self.num_classes)])
```

**TensorFlow Implementation:**
```python
class VGG(d2l.Classifier):
    def __init__(self, arch, lr=0.1, num_classes=10):
        super().__init__()
        self.save_hyperparameters()
        self.net = tf.keras.models.Sequential()
        for (num_convs, num_channels) in arch:
            self.net.add(vgg_block(num_convs, num_channels))
        self.net.add(
            tf.keras.models.Sequential([
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(4096, activation='relu'),
            tf.keras.layers.Dropout(0.5),
            tf.keras.layers.Dense(4096, activation='relu'),
            tf.keras.layers.Dropout(0.5),
            tf.keras.layers.Dense(num_classes)]))
```

Learn Before

Related

Learn After