Concept

Input Cropping Divisibility in Fully Convolutional Networks

In a Fully Convolutional Network (FCN), input images are processed using random fixed-shape cropping to maintain exact pixel correspondence for semantic segmentation tasks. To ensure that the spatial dimensions of the network's final output perfectly match the input crop after a series of downsampling and upsampling operations, the selected crop dimensions must be exactly divisible by the network's total downsampling factor. For example, if the feature extraction backbone reduces the spatial dimensions by a factor of 3232, the height and width of the randomly cropped inputs (such as 320×480320 \times 480) must both be exactly divisible by 3232. This prevents spatial mismatch errors when the transposed convolutional layer subsequently upsamples the feature maps back to the original crop size.

# PyTorch batch_size, crop_size = 32, (320, 480) train_iter, test_iter = d2l.load_data_voc(batch_size, crop_size)
# MXNet batch_size, crop_size = 32, (320, 480) train_iter, test_iter = d2l.load_data_voc(batch_size, crop_size)

0

1

Updated 2026-05-21

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L