A VGG block is a fundamental structural unit introduced by Simonyan and Zisserman that uses multiple convolutions between downsampling operations. Specifically, a VGG block consists of a sequence of convolutional layers using $$ 3 \times 3 $$ kernels with a padding of 1 (which maintains the spatial height and width), followed by a $$ 2 \times 2 $$ max-pooling layer with a stride of 2 (which halves the spatial dimensions after each block). By executing multiple convolutions within a single block before downsampling via max-pooling, the network can achieve greater depth without exhausting the spatial resolution too quickly.

VGG Convolutional Block

The standard building block of early Convolutional Neural Networks consists of a convolutional layer with padding to maintain resolution, a nonlinearity (such as a ReLU), and a pooling layer (such as max-pooling) that reduces the spatial resolution. Because max-pooling typically halves the resolution in each block, this approach imposes a strict limit on network depth. Specifically, a network with an input dimension $$d$$ can accommodate at most $$\log_2 d$$ such convolutional layers before the spatial dimensions are entirely exhausted. For example, on ImageNet data, this architectural design restricts the network to a maximum of $$8$$ convolutional layers.

Claude

CNNs are made up of three-layer types—convolutional, pooling and fully-connected (FC).

In the convolutional layers, the input matrix is passed through a set of filters that output a feature map. This output is then sent to a pooling layer, which reduces the size of the feature map. This helps reduce the processing time by condensing the map to its most essential information.

The convolutional and pooling processes are repeated several times, with the number of repeats depending on the network, after which the condensed feature map outputs are sent to a series of FC layers. These FC layers then flatten the maps together and compare the probabilities of each feature occurring in conjunction with the others, until the best classification is determined.

Convolutional Neural Networks Architecture

Dive into Deep Learning

*Pros:*
- The architecture allows CNNs to learn the position and scale of features in a variety of images, making them especially good at the classification of hierarchical or spatial data and the extraction of unlabeled features.
- Because of **Parameter sharing** and **Sparsity of connections** the neural network has significantly fewer parameters that allows it to train a smaller training sets and be less prone to overfitting.

*Cons:*
- Unfortunately, this structure requires CNNs to only accept fixed-size inputs—and it only allows them to provide fixed-size outputs.

Pros and Cons of CNN Architecture

Fully connected layer is a traditional multilayer perceptron structure. Its input is a one-dimensional vector representing the output of the previous layers. Its output is a list of probabilities for different possible labels attached to the image (e.g. dog, cat, bird). The label that receives the highest probability is the classification decision.

Fully Connected Layer - Classification

Here is a clear 3D representation of the convolution Neural Network for detecting a written number. You can see clearly which layer outputs depend on which inputs:
https://www.cs.ryerson.ca/~aharley/vis/conv/

3D Visualization of a Convolution Neural Network

Three classic networks

A “filter”, sometimes called a “kernel”, is passed over the image, viewing a few pixels at a time (for example, 3X3 or 5X5). The convolution operation is a dot product of the original pixel values with weights defined in the filter. The results are summed up into one number that represents all the pixels the filter observed.

Convolution Filter (Kernel)

Pooling operators use a fixed-shape window that slides over all regions of the input tensor based on a specified stride. For each location traversed, this pooling window computes a single output. Unlike convolutional layers that compute cross-correlations using learnable parameters (kernels), the pooling layer contains no parameters. Instead, pooling operations are deterministic, typically calculating the maximum or average value within the pooling window.

Pooling Layer in Convolutional Deep Learning

Example of a Convolutional Neural Network Architecture

- Artificial neural network that builds on constructs from pyramid cells in the cerebral cortex
- Residual neural network do this by using shortcuts to jump over some layers 


ResNets Convolutional Neural Network

There are three main classic neural network architectures:
- LeNet-5
- AlexNet
- VGG - 16

Classic Convolutional Neural Network Architectures for Object Detection in Images

The Inception network, also known as GoogLeNet, is a deep convolutional neural network architecture that arranges multiple Inception modules (multi-branch convolutional blocks) into a sequential pipeline for multi-scale feature detection. GoogLeNet uses a stack of $$9$$ Inception blocks organized into three groups with max-pooling between them, and employs global average pooling in its output head to generate predictions. The stem of the network resembles earlier architectures like AlexNet and LeNet. Max-pooling between Inception block groups reduces the spatial dimensionality. The model is computationally complex and involves a large number of relatively arbitrary hyperparameters governing channel counts, the number of blocks before dimensionality reduction, and the relative partitioning of capacity across channels.

Inception Network (GoogLeNet)

The architecture means the overall structure of the network, where most neural networks are organized into groups of units called layers. Different network architecture can have different amount of layers and connections. The main thing to keep in mind when choosing an architecture design, is choosing the depth of the network and the width of each layer.

Architecture Design

We can compare convolution to a fully connected net, but with an inﬁnitely strong prior over its weights: 
                                                                                                                                                                           
1. the weights for one hidden unit must be identical to the weights of its neighbor but shifted in space. 
2. the weights must be zero, except for the small, spatially contiguous receptive ﬁeld assigned to that hidden unit.
                                                                                                                                                                           
Pooling is similar:  
                                                         
1. contains only local interactions and is equivariant to translation
2. each unit should be invariant to small translations.


Convolution and Pooling as an Inﬁnitely Strong Prior

A convolutional layer performs a cross-correlation operation between its input and a kernel, and subsequently adds a scalar bias to the result to generate an output. During model training, the convolutional layer learns two key parameters: the kernel (weights) and the scalar bias. Prior to training, these kernel weights are generally initialized with random values, analogous to the initialization process in fully connected layers.

Convolutional Layer

Although convolutional neural networks typically contain fewer learnable parameters than comparably deep multi-layer perceptrons, they can still be more computationally expensive to train. The reason is parameter sharing: each convolutional kernel weight participates in many more multiplications because it is applied across every spatial position in the input feature map. As a result, the total number of floating-point operations can exceed that of a similarly structured MLP, even though the CNN stores far fewer unique weights. This makes GPU acceleration especially valuable when training CNNs, since the parallelism of GPUs is well-suited to the large number of repeated multiply–accumulate operations that convolutions require.

CNN Computational Cost versus Parameter Count

In modern deep learning models, particularly Convolutional Neural Networks (CNNs), feature representations are learned automatically rather than being manually engineered. These features are composed hierarchically across multiple jointly learned layers. The lowest layers of the network typically learn to extract basic visual elements, such as edges, colors, and simple textures. Intermediate layers build upon these foundational features to detect more complex structures like object parts, while the highest layers assemble them to recognize entire objects. The final hidden state provides a compact representation that summarizes the image contents, allowing data belonging to different categories to be easily separated by a classifier.

Hierarchical Feature Learning in Vision Networks

Spatial Resolution Limit in CNNs

ParNet is a neural network architecture that achieves competitive performance using a much shallower design compared to deep convolutional models like VGG. Instead of relying on a deep, sequential progression of layers, ParNet utilizes a large number of parallel computations to process information. This approach demonstrates a viable alternative to the dominant trend of increasingly deep networks, suggesting that highly parallelized shallow architectures can also be highly effective for modern deep learning tasks.

ParNet Architecture

The Network in Network (NiN) architecture introduces the concept of applying a fully connected layer at each individual pixel location (height and width) of a feature map. Convolutional layers output four-dimensional tensors corresponding to the example, channel, height, and width, whereas traditional fully connected layers expect two-dimensional tensors corresponding to the example and feature. To bridge this gap without flattening the spatial dimensions prematurely, NiN utilizes $$1 	imes 1$$ convolutions. A $$1 	imes 1$$ convolution effectively acts as a fully connected layer operating independently on each pixel's channels across the spatial grid.

Network in Network (NiN) Architecture

ShiftNet is a convolutional neural network architecture designed to increase functional complexity without adding computational cost. It achieves this by mimicking the effects of a standard $$3 \times 3$$ spatial convolution through a simpler mechanism: it explicitly adds spatially shifted activations across different channels. This "shift" operation requires zero floating-point operations and zero parameters, serving as a highly efficient alternative to traditional convolutions.

ShiftNet Architecture

Traditionally, the design of deep neural networks has been a highly manual process, relying heavily on human intuition to find optimal hyperparameters and structural configurations. Because this manual approach is costly in terms of human time and does not guarantee optimal outcomes, researchers developed the notion of network design spaces. Instead of manually designing a single specific network instance, this automated strategy parameterizes a population of possible architectures—a design space—and searches within it to systematically discover high-quality models, such as the RegNetX and RegNetY families.

Learn Before

Related

Learn After