When constructing a dataset pipeline for semantic segmentation that utilizes random fixed-shape cropping, some input images may possess spatial dimensions (height or width) that are smaller than the designated output crop size. To prevent out-of-bounds errors during the cropping operation, these insufficiently sized examples must be programmatically filtered out and excluded from the training pipeline.

Image Filtering for Random Cropping in Semantic Segmentation

In a Fully Convolutional Network (FCN), input images are processed using random fixed-shape cropping to maintain exact pixel correspondence for semantic segmentation tasks. To ensure that the spatial dimensions of the network's final output perfectly match the input crop after a series of downsampling and upsampling operations, the selected crop dimensions must be exactly divisible by the network's total downsampling factor. For example, if the feature extraction backbone reduces the spatial dimensions by a factor of $$32$$, the height and width of the randomly cropped inputs (such as $$320 \times 480$$) must both be exactly divisible by $$32$$. This prevents spatial mismatch errors when the transposed convolutional layer subsequently upsamples the feature maps back to the original crop size.

```python
# PyTorch
batch_size, crop_size = 32, (320, 480)
train_iter, test_iter = d2l.load_data_voc(batch_size, crop_size)
```

```python
# MXNet
batch_size, crop_size = 32, (320, 480)
train_iter, test_iter = d2l.load_data_voc(batch_size, crop_size)
```

Input Cropping Divisibility in Fully Convolutional Networks

In semantic segmentation tasks, the input image and its corresponding ground-truth label maintain a strict one-to-one spatial correspondence at the pixel level. Because of this precise alignment, simply rescaling images to fit a model's required input shape is problematic; it requires inversely rescaling the predicted pixel classes back to the original dimensions during inference, which introduces inaccuracies along the boundaries of different semantic regions. To avoid these artifacts and preserve exact pixel correspondence, the input images and their labels are typically subjected to random fixed-shape cropping rather than rescaling.

Claude

The Pascal VOC2012 dataset serves as one of the most critical and widely adopted benchmarks for semantic segmentation tasks in computer vision. It features a comprehensive collection of images paired with detailed pixel-level annotations. Structurally, the dataset is organized into specific directories: ImageSets/Segmentation contains the text files defining the training and validation data splits, JPEGImages stores the original input photographs, and SegmentationClass houses the corresponding segmented label images.

Pascal VOC2012 Dataset

Dive into Deep Learning

In the Pascal VOC2012 dataset, semantic segmentation labels are provided in an image format that perfectly matches the spatial dimensions (height and width) of the original input images. Each pixel in a label image represents a semantic class by its RGB color; pixels of the same color belong to the same category. By convention in this dataset, black pixels designate the background, white pixels indicate borders between objects, and various other distinct colors correspond to the predefined target classes (such as aeroplane, bicycle, or bird).

Learn Before

Related

Learn After