The output channels of each Inception block in GoogLeNet are partitioned across the four parallel branches, and the intermediate dimensionality-reduction ratios vary from block to block. In Module $$ b_3 $$, the first Inception block outputs 256 channels (64 + 128 + 32 + 32) in a 2:4:1:1 ratio. The input (192 channels) is reduced by a factor of $$ \frac{1}{2} $$ for the second branch (yielding 96 intermediate channels) and by $$ \frac{1}{12} $$ for the third branch (yielding 16 intermediate channels). The second Inception block increases to 480 output channels (128 + 192 + 96 + 64) in a 4:6:3:2 ratio, with reduction factors of $$ \frac{1}{2} $$ and $$ \frac{1}{8} $$ yielding 128 and 32 intermediate channels. Across Modules $$ b_3 $$, $$ b_4 $$, and $$ b_5 $$, the second branch (with the $$ 3 \times 3 $$ convolution) consistently produces the largest share of output channels, followed by the first branch ($$ 1 \times 1 $$), the third branch ($$ 5 \times 5 $$), and the fourth branch ($$ 3 \times 3 $$ max-pooling). These ratios are slightly different in each Inception block.

Claude

Google

The GoogLeNet model is constructed from five sequential modules (labeled $$b_1$$ through $$b_5$$) followed by a fully connected output layer. The overall architecture diagram is shown in Fig. 8.4.2.

- Module $$b_1$$ (Stem): A $$7 	imes 7$$ convolutional layer with $$64$$ output channels, stride $$2$$, and padding $$3$$, followed by ReLU activation and a $$3 	imes 3$$ max-pooling layer (stride $$2$$, padding $$1$$). This module resembles the stems of AlexNet and LeNet.
- Module $$b_2$$: A $$1 	imes 1$$ convolution with $$64$$ channels, then a $$3 	imes 3$$ convolution that triples the channels to $$192$$, each followed by ReLU, concluding with $$3 	imes 3$$ max-pooling (stride $$2$$, padding $$1$$).
- Module $$b_3$$: Two Inception blocks producing $$64+128+32+32=256$$ and $$128+192+96+64=480$$ output channels respectively, followed by $$3 	imes 3$$ max-pooling.
- Module $$b_4$$: Five Inception blocks producing $$512$$, $$512$$, $$512$$, $$528$$, and $$832$$ output channels respectively, followed by $$3 	imes 3$$ max-pooling.
- Module $$b_5$$: Two Inception blocks producing $$832$$ and $$1024$$ output channels respectively, followed by global average pooling (reducing each channel to $$1 	imes 1$$) and a flatten operation.

Finally, a fully connected layer maps the $$1024$$-dimensional representation to the number of output classes.

GoogLeNet Model Architecture

The fundamental convolutional block in the GoogLeNet architecture is the Inception block. It consists of four parallel branches that process the input to extract information at different spatial scales. The first branch uses a $$1 \times 1$$ convolutional layer. The second and third branches start with a $$1 \times 1$$ convolution to reduce the number of channels and model complexity, followed by $$3 \times 3$$ and $$5 \times 5$$ convolutions, respectively. The fourth branch applies a $$3 \times 3$$ max-pooling layer followed by a $$1 \times 1$$ convolutional layer to adjust channel counts. All branches use appropriate padding to ensure the spatial dimensions (height and width) of the input and output remain identical. Finally, the outputs from these four branches are concatenated along the channel dimension to form the block's output.

Inception Block Structure

Dive into Deep Learning

GoogLeNet Channel Ratios in Inception Blocks

Passing a single-channel $$96 	imes 96$$ image through GoogLeNet produces the following output shapes at each module:

1. Module $$b_1$$ (Stem): output $$1 	imes 64 	imes 24 	imes 24$$
2. Module $$b_2$$: output $$1 	imes 192 	imes 12 	imes 12$$
3. Module $$b_3$$ (2 Inception blocks): output $$1 	imes 480 	imes 6 	imes 6$$
4. Module $$b_4$$ (5 Inception blocks): output $$1 	imes 832 	imes 3 	imes 3$$
5. Module $$b_5$$ (2 Inception blocks + global avg pool): output $$1 	imes 1024$$
6. Linear (output layer): output $$1 	imes 10$$

The input height and width are reduced from $$224$$ to $$96$$ to enable a reasonable training time on Fashion-MNIST. The spatial dimensions are progressively halved by max-pooling between modules ($$96 	o 24 	o 12 	o 6 	o 3 	o 1$$), while the number of channels grows ($$64 	o 192 	o 480 	o 832 	o 1024$$). The global average pooling in Module $$b_5$$ collapses the spatial dimensions to $$1 	imes 1$$.

GoogLeNet Layer-by-Layer Shape Trace

The GoogLeNet architecture relies on a large number of relatively arbitrary hyperparameters, marking the beginning of deliberate network design experimentation at a block level. These hyperparameters include the number of output channels per branch in each Inception block, the number of blocks before each dimensionality reduction step, and the relative partitioning of capacity across channels. Much of this complexity stems from the fact that automated tools for architecture exploration were not yet available. Network design relied on costly manual specification by the experimenter, brute-force search, and genetic algorithms rather than modern automated architecture search methods. Despite being entirely manual, this structured, block-level approach to tuning is why GoogLeNet is arguably considered the first truly modern Convolutional Neural Network (CNN).

GoogLeNet Hyperparameter Complexity

The full GoogLeNet model is assembled by sequentially connecting its five constituent modules ($$b_1$$ through $$b_5$$) and appending a fully connected output layer. The complete, multi-framework code implementations for all modules and their assembly are provided below.

**PyTorch Implementation:**
```python
class GoogleNet(d2l.Classifier):
    def b1(self):
        return nn.Sequential(
            nn.LazyConv2d(64, kernel_size=7, stride=2, padding=3),
            nn.ReLU(), nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

@d2l.add_to_class(GoogleNet)
def b2(self):
    return nn.Sequential(
        nn.LazyConv2d(64, kernel_size=1), nn.ReLU(),
        nn.LazyConv2d(192, kernel_size=3, padding=1), nn.ReLU(),
        nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

@d2l.add_to_class(GoogleNet)
def b3(self):
    return nn.Sequential(Inception(64, (96, 128), (16, 32), 32),
                         Inception(128, (128, 192), (32, 96), 64),
                         nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

@d2l.add_to_class(GoogleNet)
def b4(self):
    return nn.Sequential(Inception(192, (96, 208), (16, 48), 64),
                         Inception(160, (112, 224), (24, 64), 64),
                         Inception(128, (128, 256), (24, 64), 64),
                         Inception(112, (144, 288), (32, 64), 64),
                         Inception(256, (160, 320), (32, 128), 128),
                         nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

@d2l.add_to_class(GoogleNet)
def b5(self):
    return nn.Sequential(Inception(256, (160, 320), (32, 128), 128),
                         Inception(384, (192, 384), (48, 128), 128),
                         nn.AdaptiveAvgPool2d((1,1)), nn.Flatten())

@d2l.add_to_class(GoogleNet)
def __init__(self, lr=0.1, num_classes=10):
    super(GoogleNet, self).__init__()
    self.save_hyperparameters()
    self.net = nn.Sequential(self.b1(), self.b2(), self.b3(), self.b4(),
                             self.b5(), nn.LazyLinear(num_classes))
    self.net.apply(d2l.init_cnn)
```

**MXNet Implementation:**
```python
class GoogleNet(d2l.Classifier):
    def b1(self):
        net = nn.Sequential()
        net.add(nn.Conv2D(64, kernel_size=7, strides=2, padding=3,
                          activation='relu'),
                nn.MaxPool2D(pool_size=3, strides=2, padding=1))
        return net

@d2l.add_to_class(GoogleNet)
def b2(self):
    net = nn.Sequential()
    net.add(nn.Conv2D(64, kernel_size=1, activation='relu'),
           nn.Conv2D(192, kernel_size=3, padding=1, activation='relu'),
           nn.MaxPool2D(pool_size=3, strides=2, padding=1))
    return net

@d2l.add_to_class(GoogleNet)
def b3(self):
    net = nn.Sequential()
    net.add(Inception(64, (96, 128), (16, 32), 32),
           Inception(128, (128, 192), (32, 96), 64),
           nn.MaxPool2D(pool_size=3, strides=2, padding=1))
    return net

@d2l.add_to_class(GoogleNet)
def b4(self):
    net = nn.Sequential()
    net.add(Inception(192, (96, 208), (16, 48), 64),
            Inception(160, (112, 224), (24, 64), 64),
            Inception(128, (128, 256), (24, 64), 64),
            Inception(112, (144, 288), (32, 64), 64),
            Inception(256, (160, 320), (32, 128), 128),
            nn.MaxPool2D(pool_size=3, strides=2, padding=1))
    return net

@d2l.add_to_class(GoogleNet)
def b5(self):
    net = nn.Sequential()
    net.add(Inception(256, (160, 320), (32, 128), 128),
            Inception(384, (192, 384), (48, 128), 128),
            nn.GlobalAvgPool2D())
    return net

@d2l.add_to_class(GoogleNet)
def __init__(self, lr=0.1, num_classes=10):
    super(GoogleNet, self).__init__()
    self.save_hyperparameters()
    self.net = nn.Sequential()
    self.net.add(self.b1(), self.b2(), self.b3(), self.b4(), self.b5(),
                 nn.Dense(num_classes))
    self.net.initialize(init.Xavier())
```

**JAX Implementation:**
```python
class GoogleNet(d2l.Classifier):
    lr: float = 0.1
    num_classes: int = 10

    def 

GoogLeNet Model Code Implementation

The ResNeXt architecture addresses the trade-off between nonlinearity and dimensionality in standard ResNet designs. Instead of increasing network depth or widening convolutions, ResNeXt increases the number of channels that carry information between blocks while avoiding a quadratic computational penalty. Inspired by the Inception block's strategy of separating information flow into independent groups, ResNeXt applies the exact same transformation across all of its parallel branches. This uniform multi-branch design minimizes the need for manual hyperparameter tuning for each branch.

Learn Before

Related