When an object detection model generates numerous anchor boxes, it often outputs multiple predicted bounding boxes that surround the same underlying object, leading to significant overlap and redundancy. To simplify the output and ensure each object is detected only once, a technique known as non-maximum suppression (NMS) is employed. NMS merges or suppresses similar predicted bounding boxes that belong to the same object by retaining only the highest-confidence predictions and discarding the overlapping duplicates.

Non-Maximum Suppression (NMS)

At inference time, an anchor-box-based object detection model follows a multi-step prediction pipeline for each input image. First, the model generates multiple anchor boxes at various positions and scales across the image. Next, it predicts a class label and a set of bounding-box offsets for every anchor box simultaneously. The predicted offsets are then applied to adjust the positions and sizes of the corresponding anchor boxes, producing a set of predicted bounding boxes. Finally, a filtering step retains only those predicted bounding boxes that satisfy certain criteria (such as confidence thresholds or non-maximum suppression), which are output as the final detections.

Claude

In an object detection training dataset, every generated anchor box is treated as an individual training example. To train the model effectively, each anchor box must be assigned two specific types of supervision labels: a class label and an offset label. These labels are derived from the closest ground-truth bounding box assigned to the anchor box. Together, these labels provide the necessary signals for the model to simultaneously predict object categories and refine the bounding box coordinates.

Labeling Anchor Boxes in Training Data

Dive into Deep Learning

Object Detection Prediction Pipeline Using Anchor Boxes

To assign ground-truth bounding boxes to anchor boxes, we use an iterative algorithm based on their Intersection over Union (IoU). Given $$n_a$$ anchor boxes and $$n_b$$ ground-truth bounding boxes (where $$n_a \geq n_b$$), we compute a matrix $$\mathbf{X} \in \mathbb{R}^{n_a 	imes n_b}$$ where each element $$x_{ij}$$ is the IoU between anchor box $$A_i$$ and ground-truth box $$B_j$$. First, we find the largest element in $$\mathbf{X}$$, assign the corresponding ground-truth box to the anchor box, and discard its entire row and column. We repeat this process of finding the largest remaining element and discarding its row and column until all $$n_b$$ ground-truth boxes are assigned. Finally, we evaluate the remaining $$n_a - n_b$$ unassigned anchor boxes, assigning each to the ground-truth box with the highest IoU in its row, provided that this maximum IoU is strictly greater than a predefined threshold.

Ground-Truth Bounding Box Assignment to Anchor Boxes

To illustrate anchor box labeling, consider an image containing ground-truth bounding boxes for a dog and a cat, alongside a set of five generated anchor boxes, denoted as $$A_0, \ldots, A_4$$. The labeling process evaluates the pairs of anchor boxes and ground-truth bounding boxes based on Intersection over Union (IoU). For example, if the IoU between anchor box $$A_4$$ and the cat's ground-truth bounding box is the largest among all pairs, $$A_4$$ is labeled as the cat. Removing all pairs containing $$A_4$$ or the cat's bounding box, if the remaining pair with the largest IoU is $$A_1$$ and the dog's bounding box, $$A_1$$ is labeled as the dog. For the remaining unlabeled anchor boxes (e.g., $$A_0, A_2, A_3$$), they are assigned to the ground-truth bounding box with the highest IoU only if that IoU exceeds a predefined threshold (e.g., $$0.5$$). Thus, if $$A_2$$ has an IoU with the cat exceeding the threshold, it is labeled as the cat; conversely, if $$A_0$$ and $$A_3$$ have maximum IoUs below the threshold, they are labeled as the background class.

Learn Before

Related

Learn After