← Back to Blog

Object Detection from Scratch: Finding and Labeling Every Object in an Image

1. From Classification to Detection

Image classification looks at a photo and says "cat." Object detection looks at the same photo and says "cat at (120, 45, 280, 190), dog at (310, 60, 480, 220), person at (50, 10, 200, 350)." It doesn't just recognize — it locates. This is the leap from "what" to "what and where," and it's the foundation of every vision system that interacts with the real world.

Why is detection harder than classification? Three reasons. First, the output is variable-length — a scene might contain zero objects or fifty, and the model must handle both. Second, objects appear at wildly different scales — a distant pedestrian might be 20 pixels tall while a nearby car fills the frame. Third, the model must distinguish foreground from background at every location, creating an extreme class imbalance: in a typical image, over 99% of candidate regions contain nothing interesting.

Object detection decomposes into two sub-problems: localization (where is the object?) and classification (what is it?). Localization outputs a bounding box — a rectangle defined by four numbers: (x, y, width, height) or equivalently (x_min, y_min, x_max, y_max). The full detection pipeline outputs a list of tuples: (class, x, y, w, h, confidence). The story of modern object detection is the story of making this pipeline fast, accurate, and end-to-end differentiable. We'll build every piece from scratch, starting with the metric that makes it all work: Intersection over Union.

2. Intersection over Union and Box Regression

Before we can train a detector, we need a way to measure how well a predicted box matches a ground-truth box. The standard metric is Intersection over Union (IoU): the area of overlap divided by the area of union. IoU ranges from 0 (no overlap) to 1 (perfect match), and it's beautifully scale-invariant — two boxes covering a person and two boxes covering a postage stamp are evaluated on the same 0-to-1 scale.

Why not just use L1 or L2 distance between box coordinates? Because coordinate distances are scale-dependent. A 5-pixel error on a 500-pixel box is negligible, but on a 20-pixel box it's catastrophic. IoU captures this naturally. It also penalizes misalignment in all dimensions simultaneously — a box that's the right size but in the wrong place gets IoU = 0.

For training, we turn IoU into a loss: L = 1 - IoU. But there's a problem: when two boxes don't overlap at all, IoU = 0 and the gradient vanishes. GIoU (Generalized IoU, Rezatofighi et al. 2019) fixes this by subtracting the fraction of the smallest enclosing rectangle that isn't covered by either box, giving a useful gradient even for non-overlapping predictions. Here's the implementation:

import numpy as np

def compute_iou(boxes_a, boxes_b):
    """IoU between two sets of boxes [x_min, y_min, x_max, y_max]."""
    # Intersection coordinates
    inter_min = np.maximum(boxes_a[:, :2], boxes_b[:, :2])
    inter_max = np.minimum(boxes_a[:, 2:], boxes_b[:, 2:])
    inter_wh = np.clip(inter_max - inter_min, 0, None)
    inter_area = inter_wh[:, 0] * inter_wh[:, 1]

    area_a = (boxes_a[:, 2] - boxes_a[:, 0]) * (boxes_a[:, 3] - boxes_a[:, 1])
    area_b = (boxes_b[:, 2] - boxes_b[:, 0]) * (boxes_b[:, 3] - boxes_b[:, 1])
    union_area = area_a + area_b - inter_area

    return inter_area / np.clip(union_area, 1e-6, None)

def giou_loss(pred, target):
    """Generalized IoU loss for bounding box regression."""
    iou = compute_iou(pred, target)

    # Smallest enclosing box
    enclose_min = np.minimum(pred[:, :2], target[:, :2])
    enclose_max = np.maximum(pred[:, 2:], target[:, 2:])
    enclose_wh = enclose_max - enclose_min
    enclose_area = enclose_wh[:, 0] * enclose_wh[:, 1]

    inter_min = np.maximum(pred[:, :2], target[:, :2])
    inter_max = np.minimum(pred[:, 2:], target[:, 2:])
    inter_wh = np.clip(inter_max - inter_min, 0, None)
    inter_area = inter_wh[:, 0] * inter_wh[:, 1]
    area_p = (pred[:, 2] - pred[:, 0]) * (pred[:, 3] - pred[:, 1])
    area_t = (target[:, 2] - target[:, 0]) * (target[:, 3] - target[:, 1])
    union_area = area_p + area_t - inter_area

    giou = iou - (enclose_area - union_area) / np.clip(enclose_area, 1e-6, None)
    return 1 - giou  # loss: lower is better

# Example: predicted vs ground-truth box
pred = np.array([[50, 50, 200, 200]])
gt   = np.array([[60, 40, 210, 190]])
print(f"IoU: {compute_iou(pred, gt)[0]:.3f}")       # 0.772
print(f"GIoU loss: {giou_loss(pred, gt)[0]:.3f}")    # 0.236

The compute_iou function works on batches of boxes by vectorizing the intersection and union calculations. GIoU loss adds the enclosing box penalty, ensuring gradients flow even when boxes are far apart. This is the metric that drives every modern detector's training loop.

3. Sliding Windows and Region Proposals

The earliest approach to detection was brute force: slide a fixed-size classifier across every position in the image, at every scale and aspect ratio. For a 640×480 image with 10 scales and 3 aspect ratios, that's roughly 9 million windows to classify. Even with a fast classifier, this is intractable.

Selective search (2011) reduced the candidate set to ~2,000 region proposals using hierarchical image segmentation — merging similar adjacent regions into plausible object locations. R-CNN (Girshick et al. 2014) combined selective search with deep learning: extract each proposal, resize it to a fixed size, feed it through a CNN to get features, then classify with an SVM. It worked — but running a CNN independently on 2,000 crops per image was painfully slow (47 seconds per image).

The breakthrough came from sharing computation. Fast R-CNN (2015) runs the CNN once on the entire image, producing a feature map, then projects each proposal's coordinates onto that feature map and pools features from the corresponding region. This is dramatically faster because the expensive convolution happens only once. Faster R-CNN (Ren et al. 2015) took the final step: replacing selective search with a Region Proposal Network (RPN) that shares the backbone CNN, making proposals themselves learned and differentiable. The two-stage paradigm was born: stage 1 proposes candidate regions, stage 2 classifies and refines them. This was state-of-the-art — until someone asked: do we even need two stages?

4. Anchor Boxes and Multi-Scale Detection

The anchor is one of detection's most influential ideas. Instead of predicting box coordinates from scratch, define a set of reference boxes (anchors) tiled across the feature map, and predict offsets from each anchor to the true object. Each spatial location gets k anchors — typically 3 scales × 3 aspect ratios = 9 anchors — covering different object shapes.

For each anchor, the network predicts two things: (1) classification — is there an object here or just background? (2) regression — four deltas (dx, dy, dw, dh) that shift and resize the anchor to fit the actual object. During training, we need to decide which anchors are "positive" (responsible for detecting an object) and which are "negative" (background). The standard rule: an anchor is positive if its IoU with any ground-truth box exceeds 0.7, negative if all IoUs are below 0.3, and ignored if in between.

But objects vary hugely in size. A Feature Pyramid Network (FPN, Lin et al. 2017) addresses this with multi-scale feature maps: the deep, low-resolution layers detect large objects (they have large receptive fields), while shallow, high-resolution layers detect small objects (they preserve fine spatial detail). Top-down connections fuse semantic information from deep layers into the high-resolution maps. Here's how anchor generation and target matching work:

import numpy as np

def generate_anchors(feat_h, feat_w, stride, scales, ratios):
    """Generate anchors for one FPN level."""
    anchors = []
    for y in range(feat_h):
        for x in range(feat_w):
            cx = (x + 0.5) * stride
            cy = (y + 0.5) * stride
            for s in scales:
                for r in ratios:
                    w = s * np.sqrt(r)
                    h = s / np.sqrt(r)
                    anchors.append([cx - w/2, cy - h/2, cx + w/2, cy + h/2])
    return np.array(anchors)

def match_anchors(anchors, gt_boxes, pos_thresh=0.7, neg_thresh=0.3):
    """Assign each anchor to a ground-truth box or background."""
    n_anchors = len(anchors)
    n_gt = len(gt_boxes)
    labels = -np.ones(n_anchors, dtype=int)  # -1 = ignore
    matched_gt = np.zeros(n_anchors, dtype=int)

    if n_gt == 0:
        labels[:] = 0  # all negative
        return labels, matched_gt

    # IoU matrix: [n_anchors, n_gt]
    ious = np.zeros((n_anchors, n_gt))
    for j in range(n_gt):
        ious[:, j] = compute_iou(anchors, np.tile(gt_boxes[j], (n_anchors, 1)))

    best_gt_per_anchor = ious.argmax(axis=1)
    best_iou_per_anchor = ious.max(axis=1)

    labels[best_iou_per_anchor < neg_thresh] = 0          # negative
    labels[best_iou_per_anchor >= pos_thresh] = 1         # positive
    matched_gt = best_gt_per_anchor

    # Ensure each GT has at least one positive anchor
    best_anchor_per_gt = ious.argmax(axis=0)
    labels[best_anchor_per_gt] = 1
    matched_gt[best_anchor_per_gt] = np.arange(n_gt)

    return labels, matched_gt

# Example: 4x4 feature map, stride 16, one scale, one ratio
anchors = generate_anchors(4, 4, stride=16, scales=[32], ratios=[1.0])
gt_boxes = np.array([[20, 10, 60, 55]])
labels, matched = match_anchors(anchors, gt_boxes)
print(f"Anchors: {len(anchors)}, Positive: {(labels==1).sum()}, Negative: {(labels==0).sum()}")

The generate_anchors function tiles boxes across a grid, and match_anchors assigns each one to a ground-truth object or marks it as background. The "ensure each GT has at least one positive anchor" rule is critical — without it, small objects that don't reach the 0.7 IoU threshold with any anchor would have no training signal at all.

5. Non-Maximum Suppression

A detector with anchors at every spatial location produces thousands of candidate boxes per image. Most of them overlap heavily — a person might trigger positive detections in dozens of nearby anchors. We need to collapse these redundant detections into one box per object. That's the job of Non-Maximum Suppression (NMS).

The algorithm is greedy and elegant: sort all detections by confidence, keep the most confident one, remove every other detection that overlaps with it above a threshold (typically IoU > 0.5), then repeat with the next highest remaining detection. It's O(n²) in the worst case but very fast in practice because most boxes are eliminated early.

Standard NMS has a weakness: the hard IoU threshold. Set it too low and you aggressively remove nearby objects of the same class (two people standing side by side). Set it too high and you get duplicate detections. Soft-NMS (Bodla et al. 2017) replaces the hard cutoff with a Gaussian decay: instead of removing overlapping boxes entirely, it reduces their confidence proportionally to their overlap. This preserves nearby objects while still suppressing true duplicates.

import numpy as np

def nms(boxes, scores, iou_threshold=0.5):
    """Standard greedy Non-Maximum Suppression."""
    order = scores.argsort()[::-1]
    keep = []

    while len(order) > 0:
        i = order[0]
        keep.append(i)
        if len(order) == 1:
            break
        remaining = order[1:]
        ious = compute_iou(
            np.tile(boxes[i], (len(remaining), 1)),
            boxes[remaining]
        )
        order = remaining[ious < iou_threshold]

    return np.array(keep)

def soft_nms(boxes, scores, sigma=0.5, score_threshold=0.01):
    """Soft-NMS: decay scores instead of hard removal."""
    n = len(boxes)
    indices = np.arange(n)
    out_scores = scores.copy()

    for i in range(n):
        max_idx = out_scores[indices[i:]].argmax() + i
        # Swap current position with max
        indices[i], indices[max_idx] = indices[max_idx], indices[i]

        best = indices[i]
        remaining = indices[i+1:]
        if len(remaining) == 0:
            break
        ious = compute_iou(
            np.tile(boxes[best], (len(remaining), 1)),
            boxes[remaining]
        )
        # Gaussian decay
        out_scores[remaining] *= np.exp(-ious**2 / sigma)

    keep = indices[out_scores[indices] > score_threshold]
    return keep

# Example: 5 overlapping boxes around one object
boxes = np.array([[100,100,200,200],[105,98,205,198],
                  [110,102,210,202],[300,300,400,400],[305,298,405,398]])
scores = np.array([0.9, 0.85, 0.7, 0.95, 0.6])
print(f"Before NMS: {len(boxes)} boxes")
print(f"After NMS:  {len(nms(boxes, scores))} boxes")  # 2 boxes (one per object)

NMS is the final step in virtually every detection pipeline. It takes the raw, overlapping output of the network and produces clean, one-box-per-object predictions. Simple, effective, and used in everything from Faster R-CNN to the latest YOLO.

6. YOLO — You Only Look Once

In 2016, Joseph Redmon asked a radical question: what if we skip proposals entirely and predict all boxes in one forward pass? The result was YOLO — You Only Look Once — and it changed detection forever.

The idea is deceptively simple. Divide the image into an S×S grid. Each grid cell is responsible for detecting objects whose center falls within that cell. For each cell, predict B bounding boxes, each with 5 values: (x, y, w, h, confidence), plus C class probabilities shared across all boxes in the cell. The entire output is a fixed-size S × S × (B×5 + C) tensor — one tensor from one forward pass.

The speed gain was staggering: YOLO v1 ran at 45 FPS, nearly 1000× faster than R-CNN. The tradeoff was accuracy, especially on small objects (a coarse 7×7 grid can't resolve nearby small items) and crowded scenes (each cell predicts at most B objects). But the core insight — detection as fixed-size regression — proved incredibly powerful.

Modern YOLO variants (v5 through v8) have evolved far beyond the original: anchor-free designs, decoupled classification and regression heads, multi-scale prediction via FPN, and aggressive augmentation like mosaic (stitching 4 training images into one). The YOLO family now dominates real-time detection, powering everything from autonomous vehicles to phone cameras. Here's a simplified YOLO-style detection head:

import numpy as np

class SimpleYOLOHead:
    """Simplified YOLO: grid-based single-pass detection."""

    def __init__(self, grid_size=7, num_classes=3, img_size=448):
        self.S = grid_size
        self.C = num_classes
        self.img_size = img_size
        self.cell_size = img_size / grid_size

    def decode(self, predictions):
        """Decode raw grid predictions into bounding boxes.

        predictions: [S, S, 5 + C] per cell: (tx, ty, tw, th, conf, class_probs)
        tx, ty are offsets within the cell (0-1)
        tw, th are fractions of image size (0-1)
        """
        S, C = self.S, self.C
        boxes, scores, classes = [], [], []

        for row in range(S):
            for col in range(S):
                cell = predictions[row, col]
                tx, ty, tw, th, conf = cell[:5]
                class_probs = cell[5:5 + C]

                # Decode to image coordinates
                cx = (col + tx) * self.cell_size
                cy = (row + ty) * self.cell_size
                w = tw * self.img_size
                h = th * self.img_size

                x_min = cx - w / 2
                y_min = cy - h / 2
                x_max = cx + w / 2
                y_max = cy + h / 2

                cls = class_probs.argmax()
                score = conf * class_probs[cls]

                if score > 0.1:  # confidence threshold
                    boxes.append([x_min, y_min, x_max, y_max])
                    scores.append(score)
                    classes.append(cls)

        if not boxes:
            return np.zeros((0, 4)), np.array([]), np.array([])

        boxes = np.array(boxes)
        scores = np.array(scores)
        keep = nms(boxes, scores, iou_threshold=0.5)
        return boxes[keep], scores[keep], np.array(classes)[keep]

# Simulate: 7x7 grid, 3 classes, one object in cell (3, 2)
preds = np.zeros((7, 7, 8))         # 5 box params + 3 classes
preds[3, 2] = [0.5, 0.5, 0.3, 0.4, 0.92, 0.05, 0.9, 0.05]

detector = SimpleYOLOHead(grid_size=7, num_classes=3)
boxes, scores, classes = detector.decode(preds)
print(f"Detected {len(boxes)} object(s), class={classes}, conf={scores}")

The decode method converts the network's raw grid output into image-coordinate bounding boxes, filters by confidence, and applies NMS. In a real YOLO, the network learns these predictions end-to-end via a multi-part loss: coordinate regression error (weighted heavily), objectness confidence error, and classification error.

7. Anchor-Free Detection

Anchor boxes work, but they're awkward. How many scales and aspect ratios should you use? The choice is a brittle hyperparameter, and anchors create a massive positive/negative imbalance — for every anchor that matches an object, there might be 1,000 that don't. This imbalance drowns the useful gradient signal in easy-negative noise.

FCOS (Tian et al. 2019) threw away anchors entirely. At every pixel in the feature map, FCOS predicts four distances: (l, t, r, b) from that pixel to the left, top, right, and bottom edges of the enclosing object, plus a centerness score that down-weights predictions far from the object center. No anchors, no anchor tuning — just dense per-pixel prediction.

CenterNet (Zhou et al. 2019) took a different approach: predict object centers as peaks in a heatmap, then regress width and height at each peak. Detection reduces to keypoint estimation — find the hotspots, read off the box dimensions. Elegant and fast.

Both approaches need to handle the foreground/background imbalance without anchors. The answer is focal loss (Lin et al. 2017): a modified cross-entropy that down-weights easy negatives by a factor of (1 - pt)γ. With γ = 2, a confident background prediction (pt = 0.99) gets its loss reduced by 10,000× compared to a hard case. This focuses learning on the rare, difficult examples that actually matter.

import numpy as np

def focal_loss(pred_prob, target, alpha=0.25, gamma=2.0):
    """Focal loss: down-weight easy negatives in dense detection."""
    pred_prob = np.clip(pred_prob, 1e-7, 1 - 1e-7)
    # p_t = p for positive, 1-p for negative
    p_t = np.where(target == 1, pred_prob, 1 - pred_prob)
    alpha_t = np.where(target == 1, alpha, 1 - alpha)

    loss = -alpha_t * (1 - p_t)**gamma * np.log(p_t)
    return loss.mean()

def fcos_decode(center_map, distance_map, stride=8, threshold=0.3):
    """Decode FCOS-style per-pixel predictions into boxes."""
    H, W = center_map.shape
    boxes, scores = [], []

    for y in range(H):
        for x in range(W):
            centerness = center_map[y, x]
            if centerness < threshold:
                continue
            l, t, r, b = distance_map[y, x]
            px = (x + 0.5) * stride  # pixel coordinate
            py = (y + 0.5) * stride

            x_min = px - l
            y_min = py - t
            x_max = px + r
            y_max = py + b

            if x_max > x_min and y_max > y_min:
                boxes.append([x_min, y_min, x_max, y_max])
                scores.append(centerness)

    return np.array(boxes) if boxes else np.zeros((0,4)), np.array(scores)

# Focal loss demo: compare standard CE vs focal on an easy negative
easy_neg_prob = 0.01  # model gives 1% foreground prob (correctly ID's background)
ce_loss = -np.log(1 - easy_neg_prob)          # standard CE for this sample
fl_loss = focal_loss(np.array([easy_neg_prob]), np.array([0]))
print(f"Easy negative: CE={ce_loss:.4f}, Focal={fl_loss:.8f}")
# p_t = 0.99, so (1-p_t)^2 = 0.0001 — focal loss ~10,000x smaller

Focal loss is now standard in all single-stage detectors. The reduction factor (1 - pt)γ automatically handles the 1:1000+ foreground/background ratio without needing anchor-based sampling tricks. The trend in detection is clear: from complex two-stage pipelines with hand-crafted anchors toward simple, anchor-free, end-to-end approaches.

8. Evaluation — mAP and the COCO Metrics

How do you measure detector quality? A detection is "correct" if two conditions hold: the predicted class matches the ground truth, and the IoU between predicted and ground-truth boxes exceeds a threshold. The classic PASCAL VOC metric uses IoU > 0.5 (called [email protected]). The stricter COCO metric averages over ten thresholds from 0.5 to 0.95 in steps of 0.05 (called AP@[.5:.95]), rewarding precise localization.

For each class, we compute a precision-recall curve by sweeping the confidence threshold from high to low. Average Precision (AP) is the area under this curve. Mean Average Precision (mAP) averages AP across all classes. COCO also reports size-specific metrics — APsmall (object area < 32² pixels), APmedium, and APlarge — revealing whether a model handles small objects well or only succeeds on large, easy targets.

The speed-accuracy tradeoff defines detector families: YOLO variants dominate the speed end (30-160 FPS) with competitive accuracy, while two-stage detectors and transformer-based approaches like DETR achieve the highest mAP at slower speeds. In practice, the right detector depends on your latency budget: autonomous driving needs 30+ FPS, while medical image analysis can afford minutes per image.

Try It: IoU Explorer & NMS Visualizer

Drag the colored boxes to see IoU change in real time

Try It: YOLO Grid Detector

9. References & Further Reading