Because the standard VGG-11 architecture is computationally demanding, it is common to construct a modified version with a smaller number of channels when training on simpler datasets like Fashion-MNIST. Rather than starting with $$64$$ channels, the network can be instantiated with a reduced architecture configuration, such as starting with $$16$$ channels and doubling them progressively (e.g., $$16$$, $$32$$, $$64$$, $$128$$, $$128$$). This reduced-capacity network remains more than sufficient for the Fashion-MNIST classification task while significantly accelerating the training process and exhibiting only a small amount of overfitting.

Claude

The original VGG network is commonly referred to as VGG-11 because it contains a total of eleven layers with learnable weights: eight convolutional layers and three fully connected layers. The convolutional feature extractor is constructed using five VGG blocks in sequence. The first two blocks contain one convolutional layer each, while the final three blocks contain two convolutional layers each. The network employs a strategy where the spatial dimensions are halved after each block while the number of feature channels doubles. Starting with $$64$$ output channels in the first block, the channels progressively double ($$128$$, $$256$$, $$512$$) until capping at $$512$$ in the final block, before the resulting feature map is flattened and fed into the fully connected dense layers.

VGG-11 Architecture

Dive into Deep Learning

Training VGG-11 on Fashion-MNIST

We can trace the dimensionality transformations of an input image (e.g., with a spatial shape of $$ 224 \times 224 $$) as it passes through the VGG-11 network. The architecture halves the spatial height and width at each of the five VGG blocks due to the max-pooling operations. The resolution systematically drops from $$ 224 \times 224 $$ to $$ 112 \times 112 $$, $$ 56 \times 56 $$, $$ 28 \times 28 $$, $$ 14 \times 14 $$, and finally reaches $$ 7 \times 7 $$. Meanwhile, the number of channels progressively expands up to $$ 512 $$. The resulting $$ 512 \times 7 \times 7 $$ feature map is then flattened into a $$ 25088 $$-dimensional representation before being fed into the fully connected dense layers.

Learn Before

Related