To restore the spatial dimensions of feature maps to the original input image size in a Fully Convolutional Network (FCN), a transposed convolutional layer is employed. If the spatial dimensions need to be increased by a factor of $$ s $$, the transposed convolution is configured with a stride of $$ s $$. To achieve the exact original dimensions, the padding is set to $$ s/2 $$ (assuming $$ s/2 $$ is an integer), and the height and width of the convolution kernel are set to $$ 2s $$. For instance, to upscale a feature map by 32 times, the stride is 32, the padding is 16, and the kernel size is 64.

```python
# PyTorch
net.add_module('transpose_conv', nn.ConvTranspose2d(num_classes, num_classes, kernel_size=64, padding=16, stride=32))
```

```python
# MXNet
net.add(nn.Conv2DTranspose(num_classes, kernel_size=64, padding=16, strides=32))
```

Transposed Convolution Configuration for FCN Upsampling

In a Fully Convolutional Network (FCN), after image features have been extracted by the backbone network, a $$ 1 \times 1 $$ convolutional layer is applied. The purpose of this layer is to transform the number of output channels from the feature extractor to match the exact number of target classes (e.g., 21 classes for the Pascal VOC2012 dataset) without altering the spatial dimensions of the feature maps.

```python
# PyTorch
num_classes = 21
net.add_module('final_conv', nn.Conv2d(512, num_classes, kernel_size=1))
```

```python
# MXNet
num_classes = 21
net.add(nn.Conv2D(num_classes, kernel_size=1))
```

Claude

Google

If the input has multiple channels, a 1 x 1 convolution filter would help with combining all the numbers in the corresponding cells of all the input channels into one output number.

If we convolve an *n x n x m* input using *f* channels of 1 x 1 convolution filters, we would get an *n x n x f* output, where each cell is a linear combination (weighted average) of all the corresponding cells in different channels of the input.

This helps with reducing the number of channels to save computational power and memory.

1 x 1 Convolution Layer in Neural Networks  
 (Network ~ Network)

When utilizing a pretrained convolutional neural network, such as ResNet-18, for feature extraction in a Fully Convolutional Network (FCN), the final classification layers must be removed. Specifically, the global average pooling layer and the fully connected layer are discarded because they collapse spatial dimensions and are unnecessary for dense pixel-level predictions. The remaining layers form the feature extraction backbone of the FCN, which produces feature maps with reduced spatial dimensions. For example, given an input with a height of 320 and width of 480, the forward propagation reduces the spatial dimensions to 1/32 of the original, resulting in an output shape of $$ 10 \times 15 $$. This extraction process can be implemented in frameworks like PyTorch or MXNet by explicitly selecting all layers except the final pooling and dense layers.

```python
# PyTorch
net = nn.Sequential(*list(pretrained_net.children())[:-2])
```

```python
# MXNet
net = nn.HybridSequential()
for layer in pretrained_net.features[:-2]:
    net.add(layer)
```

Feature Extraction in Fully Convolutional Networks

Dive into Deep Learning

The bottleneck layer reduces the training time by diminishing the number of features and operations. By reducing the number of nodes of a newer layer in comparison to previous layers, you can reduce dimensionality.

As shown in the figure, the bottom architecture requires 120 M computations, but by adding the bottleneck layer in the middle, in the architecture shown on top, the number of computations is reduced to 12.4 M.

Learn Before

Related

Learn After