The ResNeXt architecture addresses the trade-off between nonlinearity and dimensionality in standard ResNet designs. Instead of increasing network depth or widening convolutions, ResNeXt increases the number of channels that carry information between blocks while avoiding a quadratic computational penalty. Inspired by the Inception block's strategy of separating information flow into independent groups, ResNeXt applies the exact same transformation across all of its parallel branches. This uniform multi-branch design minimizes the need for manual hyperparameter tuning for each branch.

ResNeXt Architecture

The output channels of each Inception block in GoogLeNet are partitioned across the four parallel branches, and the intermediate dimensionality-reduction ratios vary from block to block. In Module $$ b_3 $$, the first Inception block outputs 256 channels (64 + 128 + 32 + 32) in a 2:4:1:1 ratio. The input (192 channels) is reduced by a factor of $$ \frac{1}{2} $$ for the second branch (yielding 96 intermediate channels) and by $$ \frac{1}{12} $$ for the third branch (yielding 16 intermediate channels). The second Inception block increases to 480 output channels (128 + 192 + 96 + 64) in a 4:6:3:2 ratio, with reduction factors of $$ \frac{1}{2} $$ and $$ \frac{1}{8} $$ yielding 128 and 32 intermediate channels. Across Modules $$ b_3 $$, $$ b_4 $$, and $$ b_5 $$, the second branch (with the $$ 3 \times 3 $$ convolution) consistently produces the largest share of output channels, followed by the first branch ($$ 1 \times 1 $$), the third branch ($$ 5 \times 5 $$), and the fourth branch ($$ 3 \times 3 $$ max-pooling). These ratios are slightly different in each Inception block.

GoogLeNet Channel Ratios in Inception Blocks

The fundamental convolutional block in the GoogLeNet architecture is the Inception block. It consists of four parallel branches that process the input to extract information at different spatial scales. The first branch uses a $$1 \times 1$$ convolutional layer. The second and third branches start with a $$1 \times 1$$ convolution to reduce the number of channels and model complexity, followed by $$3 \times 3$$ and $$5 \times 5$$ convolutions, respectively. The fourth branch applies a $$3 \times 3$$ max-pooling layer followed by a $$1 \times 1$$ convolutional layer to adjust channel counts. All branches use appropriate padding to ensure the spatial dimensions (height and width) of the input and output remain identical. Finally, the outputs from these four branches are concatenated along the channel dimension to form the block's output.

Claude

University of Michigan - Ann Arbor

The Inception network, also known as GoogLeNet, is a deep convolutional neural network architecture that arranges multiple Inception modules (multi-branch convolutional blocks) into a sequential pipeline for multi-scale feature detection. GoogLeNet uses a stack of $$9$$ Inception blocks organized into three groups with max-pooling between them, and employs global average pooling in its output head to generate predictions. The stem of the network resembles earlier architectures like AlexNet and LeNet. Max-pooling between Inception block groups reduces the spatial dimensionality. The model is computationally complex and involves a large number of relatively arbitrary hyperparameters governing channel counts, the number of blocks before dimensionality reduction, and the relative partitioning of capacity across channels.

Inception Network (GoogLeNet)

Dive into Deep Learning

If the input has multiple channels, a 1 x 1 convolution filter would help with combining all the numbers in the corresponding cells of all the input channels into one output number.

If we convolve an *n x n x m* input using *f* channels of 1 x 1 convolution filters, we would get an *n x n x f* output, where each cell is a linear combination (weighted average) of all the corresponding cells in different channels of the input.

This helps with reducing the number of channels to save computational power and memory.

1 x 1 Convolution Layer in Neural Networks  
 (Network ~ Network)

The bottleneck layer reduces the training time by diminishing the number of features and operations. By reducing the number of nodes of a newer layer in comparison to previous layers, you can reduce dimensionality.

As shown in the figure, the bottom architecture requires 120 M computations, but by adding the bottleneck layer in the middle, in the architecture shown on top, the number of computations is reduced to 12.4 M.

Bottleneck Layer in Inception Network

Auxiliary classifiers are side branches of an inception network take some hidden layers to make a prediction using a few connected layers and Softmax activation to predict the output label.

They help to ensure that the features computed, even at intermediate layers, are not too bad for protecting the output class of a image. In other words, they have some regularizing effect on the inception network and help prevent it from overfitting.

Auxiliary Classifiers in Inception Network

Going deeper with convolutions paper

1. Inception

2. Multiple loss: perception net has 22 layers deep. In addition to the output of the last layer, it also uses an auxiliary classification node, that is, the output of a middle layer is used as classification and added to the final classification result according to a small weight (0.3), which is equivalent to model fusion. At the same time, it adds back-propagation gradient signal to the network and provides additional regularization.

Features of GoogLeNet

Inception Block Structure

The GoogLeNet model is constructed from five sequential modules (labeled $$b_1$$ through $$b_5$$) followed by a fully connected output layer. The overall architecture diagram is shown in Fig. 8.4.2.

- Module $$b_1$$ (Stem): A $$7 	imes 7$$ convolutional layer with $$64$$ output channels, stride $$2$$, and padding $$3$$, followed by ReLU activation and a $$3 	imes 3$$ max-pooling layer (stride $$2$$, padding $$1$$). This module resembles the stems of AlexNet and LeNet.
- Module $$b_2$$: A $$1 	imes 1$$ convolution with $$64$$ channels, then a $$3 	imes 3$$ convolution that triples the channels to $$192$$, each followed by ReLU, concluding with $$3 	imes 3$$ max-pooling (stride $$2$$, padding $$1$$).
- Module $$b_3$$: Two Inception blocks producing $$64+128+32+32=256$$ and $$128+192+96+64=480$$ output channels respectively, followed by $$3 	imes 3$$ max-pooling.
- Module $$b_4$$: Five Inception blocks producing $$512$$, $$512$$, $$512$$, $$528$$, and $$832$$ output channels respectively, followed by $$3 	imes 3$$ max-pooling.
- Module $$b_5$$: Two Inception blocks producing $$832$$ and $$1024$$ output channels respectively, followed by global average pooling (reducing each channel to $$1 	imes 1$$) and a flatten operation.

Finally, a fully connected layer maps the $$1024$$-dimensional representation to the number of output classes.

GoogLeNet Model Architecture

A defining characteristic of GoogLeNet is that it is computationally cheaper to evaluate than its predecessors while simultaneously providing improved accuracy. This architecture initiated a shift toward deliberate network design, where researchers explicitly trade off the computational cost of inference against the reduction of prediction errors.

GoogLeNet Computational Efficiency Trade-off

The Inception module is built on a few design principles. It uses 1×1 convolutions to cheaply raise or lower the number of channels (the channel dimension). It runs convolutions of several window sizes—1×1, 3×3, and 5×5—in parallel on the same input, so features at different spatial scales are extracted at once, yielding a richer representation. It also approximates a sparse connectivity structure with dense, grouped components, which speeds up convergence. Separating the branches by filter size groups highly correlated features together within each branch; since training ultimately aims to extract independent features, clustering correlated features in advance helps the network converge faster.

Learn Before

Related

Learn After