In deep learning frameworks, multiple bounding boxes for a single image are programmatically represented as a two-dimensional tensor of shape $$(n, 4)$$, where $$n$$ is the total number of bounding boxes. When processing a minibatch of images, this representation expands to a three-dimensional tensor of shape $$(b, n, 4)$$, where $$b$$ represents the batch size. In both configurations, the final dimension of size 4 holds the numerical parameters defining each box, such as the two corner coordinates or the center position paired with width and height.

Claude

Google

In object detection, a bounding box is utilized to describe the precise spatial location of a target object within an image. Unlike simple image classification, object detection requires localizing the objects, which is generally achieved by defining a rectangular boundary that perfectly encapsulates the item of interest. These boundaries are plotted on an image coordinate system where the origin $$(0, 0)$$ is situated at the upper-left corner, with the positive $$x$$-axis extending horizontally to the right and the positive $$y$$-axis extending vertically downward.

Bounding Box in Object Detection

Dive into Deep Learning

In object detection, a bounding box is often represented by the coordinates of its upper-left and lower-right corners. Specifically, the box is determined by the $$x$$ and $$y$$ coordinates of the upper-left corner and the corresponding $$x$$ and $$y$$ coordinates of the lower-right corner within the image.

Two-Corner Bounding Box Representation

In object detection, an alternative bounding box representation uses the $$(x, y)$$-axis coordinates of the bounding box's center point, along with the overall width and height of the box.

Center-Width-Height Bounding Box Representation

To visually verify the accuracy of object detection models, bounding boxes can be drawn directly onto images using plotting libraries like `matplotlib`. Since plotting libraries often require specific input formats—such as the $$(x, y)$$ coordinates of the upper-left corner paired with the width and height of the rectangle—a programmatic helper function is typically used to convert the standard bounding box representation into the appropriate format for drawing the rectangular patches. This allows for a clear visual confirmation that the predicted boxes align with the main outlines of the target objects.

Visualizing Bounding Boxes

Bounding Box Tensor Shape

An anchor box (also called a prior box) is a predefined bounding box that is centered on every pixel of an input image and serves as a candidate region for detecting objects. Rather than searching for objects at arbitrary positions and sizes, an object detection model generates a fixed set of anchor boxes with varying shapes across the entire image and then predicts how each anchor should be adjusted to tightly fit nearby objects. Each anchor box is fully determined once its center position, scale, and aspect ratio are specified. Anchor boxes provide the initial spatial hypotheses that a detector refines, enabling efficient coverage of potential ground-truth bounding boxes without exhaustive search.

Anchor Box in Object Detection

In object detection, the similarity between two bounding boxes is measured using their Jaccard index, commonly referred to as the intersection over union (IoU). By treating the pixel area of each bounding box as a set of pixels, the IoU is calculated as the ratio of their intersection area to their union area: $$J(\mathcal{A},\mathcal{B}) = \frac{|\mathcal{A} \cap \mathcal{B}|}{|\mathcal{A} \cup \mathcal{B}|}$$. The IoU value ranges from 0 to 1, where 0 indicates that the bounding boxes have no overlap, and 1 indicates that they are exactly equal.

Learn Before

Related