The output tensor containing all generated anchor boxes for an image typically has an initial shape of $$(	ext{batch size}, 	ext{total anchor boxes}, 4)$$. To easily access the anchor boxes centered on a specific pixel, this tensor can be reshaped to $$(	ext{image height}, 	ext{image width}, 	ext{anchor boxes per pixel}, 4)$$. Once reshaped, the coordinates of any individual anchor box can be directly retrieved by indexing into the tensor using its $$(y, x)$$ spatial location and its specific index among the multiple anchor boxes assigned to that pixel.

Indexing Anchor Box Coordinates

To programmatically generate multiple anchor boxes, a function can be defined that takes an input image tensor alongside lists of desired scales and aspect ratios. The algorithm constructs a grid of center points offset by $$0.5$$ to align with the center of each pixel, scaling these points by the inverse of the image's height and width. It then computes the widths and heights for the anchor boxes based on a practical strategy that pairs each scale with the first aspect ratio, and the first scale with each aspect ratio. Finally, the generated center coordinates are combined with the computed dimensions to return a single output tensor containing the bounding box coordinates for all anchor boxes across the entire image.

Claude

When generating anchor boxes with $$n$$ scales $$s_1, \ldots, s_n$$ and $$m$$ aspect ratios $$r_1, \ldots, r_m$$, using every possible $$(s_i, r_j)$$ combination at each pixel would produce $$whnm$$ total anchor boxes, which is computationally prohibitive. In practice, only those pairings that include either the first scale $$s_1$$ or the first aspect ratio $$r_1$$ are retained:

$$(s_1, r_1), (s_1, r_2), \ldots, (s_1, r_m), (s_2, r_1), (s_3, r_1), \ldots, (s_n, r_1)$$

This yields $$n + m - 1$$ distinct anchor boxes per pixel and $$wh(n + m - 1)$$ anchor boxes for the entire image, dramatically reducing the computational burden while still providing sufficient shape diversity to cover most ground-truth objects.

Learn Before

Related

Learn After