Chapter 7: Basics of object detection

Overview

Creating a detection dataset

Region proposals

Selective Search Algorithm

Non-max suppression

Mean Average Precision

R-CNN (Region-Based CNN)

Implementation

Additional resources

Overview

Object detection comes handy when there are multiple classes to detect in a single image, and when the object is much smaller than the image

2 outputs: classes and a bounding boxes

Steps to train a detection model

Create a dataset of images, with all bounding boxes position and labels

Scan images to get region of interests, that may contain objects.

Selective search and anchor box are 2 common techniques for region proposals

Make a prediction, and compute the Intersection over Union (IoU) class

Compute the bounding box offset to rectify bounding boxes found in 2.

Create a model using classes and bounding boxes

Measure accuracy of the detection using mean average precision (MAP)

Intersection over Union (IoU)

IoU = 0 when there is no overlap between the prediction and the ground truth

IoU = 1 when the two boxes are superposed

Creating a detection dataset

Install the cvat annotation tool by cloning the project:

cvat​

Choose your installation method, I recommend using Docker

Entire workflow

Shell

Copy

$ git clone https://github.com/openvinotoolkit/cvat.git
$ cd cvat
$ docker-compose up -d
$ docker exec -it cvat bash -ic 'python3 ~/manage.py createsuperuser'

Next, open your browser and connect to http://localhost:8080/

Here I already have a project created, but you won’t

Click “Create a new task”, and specify the labels that you want to detect

Follow the workflow, including uploading images

Start labelling

Click on the left rectangular shape, choose your label and draw the bounding box around the object

For detection task, we can only draw rectangles even if it seems imperfect

Later on when studying image segmentation, we will fit the mask to the shape we want to detect

You can add several boxes on the same image

Once you have labeled the image, click on the “>” button at the top to display the next one

Click on “Save” from time to time to save your progresses

When you have labelled all your data, export the dataset by clicking on “Menu” > “Export task”

Choose the format of the dataset depending of the model

Pascal VOC is a frequent choice when working with NVIDIA models for exemple

Dataset folder is made of:

labelmap.txt

Plain Text

Copy

# label:color_rgb:parts:actions
background:0,0,0::
tape:128,0,0::

Here I only have one class to detect for the entire dataset, “tape”

“Background” is always added, and is the class when no object is present in the image

One xml file is generated for each image

XML

Copy

<annotation>
<folder>IMG</folder>
<filename>IMG_20211210_154131.jpg</filename>
<source>
<database>Unknown</database>
<annotation>Unknown</annotation>
<image>Unknown</image>
</source>
<size>
<width>4608</width>
<height>3456</height>
<depth></depth>
</size>
<segmented>0</segmented>
<object>
<name>tape</name>
<truncated>0</truncated>
<occluded>0</occluded>
<difficult>0</difficult>
<bndbox>
<xmin>1240.67</xmin>
<ymin>1368.33</ymin>
<xmax>4301.7</xmax>
<ymax>1809.15</ymax>
</bndbox>
</object>
</annotation>

Google Colaboratory

https://colab.research.google.com/drive/1dJMqqEkOW9sV9Xj3ds-NISmPRqcXocfN

Region proposals

Goal: to identify pixel with similar values and creates cluster

SelectiveSearch over-segments image in small patches based on pixel intensity using graph-based segmentation method

Over segmentation

Loop:

Add bounding boxes segment to region proposals

group adjacent segment based on similarity

Selective Search Algorithm

Off-the-shelf felzenswalb

Python

Copy

from skimage.segmentation import felzenszwalb

segments_fz = felzenszwalb(img, scale=200)

Pixel with similar values are clustered together into a region proposal

The neural networks only needs to determine whether a region is an object or a background

In this is an object, this region helps us create the bounding box offset and determine its class

Complete selective selective search (using Felzenswalb internally)

Python

Copy

!pip install -q selectivesearch
import selectivesearch

def extract_candidates(img):
  img_lbl, regions = selectivesearch.selective_search(img, scale=200, min_size=100)
  img_area = np.prod(img.shape[:2])
  candidates = []
for r in regions:
if (not r["rect"] in candidates) \
    and (r["size"] > img_area * 0.05) \
    and (r["size"] < img_area):
      candidates.append(r["rect"])
return candidates

Non-max suppression

In the tiger image above, many boxes are overlapping

If we obtained these boxes from a detection model and each box is mapped to a score, non-max suppression (nms) will order boxes by confidence score.

Then it will remove all boxes with a lesser confidence above a IoU threshold with the reference box

Otherwise, if run after segmentation —like above— we don’t have scores. So, we just order boxes using an arbitrary axis (like y2) before also removing boxes by using IoU.

Implementation

Python

Copy

def NMS(boxes, overlapThresh=0.3):
if len(boxes) == 0:
return []

    x1 = boxes[:, 0] # x coordinate of the top-left corner
    y1 = boxes[:, 1] # y coordinate of the top-left corner
    x2 = boxes[:, 2] # x coordinate of the bottom-right corner
    y2 = boxes[:, 3] # y coordinate of the bottom-right corner
    
		areas = (x2 - x1 + 1) * (y2 - y1 + 1) # We add 1, because the pixel at the start as well as at the end counts

    indices = np.arange(len(x1))
for idx, box in enumerate(boxes):
        
        temp_indices = indices[indices != idx]
        
        xx1 = np.maximum(box[0], boxes[temp_indices, 0])
        yy1 = np.maximum(box[1], boxes[temp_indices, 1])
        xx2 = np.minimum(box[2], boxes[temp_indices, 2])
        yy2 = np.minimum(box[3], boxes[temp_indices, 3])

        w = np.maximum(0, xx2 - xx1 + 1)
        h = np.maximum(0, yy2 - yy1 + 1)

        overlap = (w * h) / areas[temp_indices]
# if the actual boungding box has an overlap bigger than treshold with any other box, remove it's index  
if np.any(overlap) > overlapThresh:
            indices = indices[indices != idx]
return boxes[indices].astype(int)

In our case, it return a single box almost as wide as the image itself. That’s because the tiger occupies the entire image.

Mean Average Precision

Precision is defined as Precision=TPTP+FPPrecision=\frac{TP}{TP+FP}Precision=TP+FPTP​﻿​

True positive are bounding box with correct class and IoU with the ground truth above some threshold

False positive are bounding box with wrong class or IoU  below some threshold

If there are multiple boxes identified, only one can be TP, the rest are FP

2 metrics

Average precision: average precision for various IoU thresholds

mAP: average precision for various IoU thresholds across all classes of the dataset

R-CNN (Region-Based CNN)

Google Colaboratory

https://colab.research.google.com/drive/1SfASCWdEiRK9CxAqwhGZrHU1iI000eBy#scrollTo=GB6J5fGFOCW8

Steps

Generate region proposal from image (many redundancy, we need to avoid false negative)

Warp each region into a image of fixed size

Forward region proposals into a pre-trained networks (ResNet50, VGG16) an extract features in a fully connected layer

Create data where inputs are the extracted features and output are the class corresponding to each region proposal, and the bbox offset from the ground truth

Two outputs: bbox regressor and label classifier

Define loss to minimize object classification error and bbox offset error, then back-propagate it

Implementation

We begin by downloading a kaggle subset of Google Open Image project, including only trucks and buses

Python

Copy

files.upload()
!mkdir ~/.kaggle
!mv kaggle.json ~/.kaggle
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download -d sixhky/open-images-bus-trucks/
!unzip -qq open-images-bus-trucks.zip

Again, you’ll need to upload your kaggle.json API key, find it on your Kaggle profile

We load the dataset. Each line is an object, i.e. a box with a label, one image can have several objects

Python

Copy

df = pd.read_csv("df.csv")

We then need to define a first Dataset class to extract image, box and label from the dataframe

We index the dataset using image name uniqueness

Then we filter the dataframe for each image name: each row is an object of the given image

We return the image and its associated list of labels and boxes

Python

Copy

class ImageDataset(Dataset):
def __init__(self, path, df):
    self.path = path
    self.df = df
    self.images = df.ImageID.unique()
def __getitem__(self, idx):
    image_id = self.images[idx]
    filename = os.path.join(self.path, image_id+".jpg")
    img = cv2.imread(filename)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    h, w = img.shape[:2]
    df_image = self.df.loc[self.df.ImageID == image_id]
    cols = df_image.columns
    col2idx = dict(zip(cols, range(len(cols))))
    bboxes, labels = [], []
for row in df_image.values:
      label = row[col2idx["LabelName"]]
      bbox = [
int(row[col2idx["XMin"]] * w),
int(row[col2idx["YMin"]] * h),
int(row[col2idx["XMax"]] * w),
int(row[col2idx["YMax"]] * h),
]
      labels.append(label)
      bboxes.append(bbox)
return img, np.array(labels), np.array(bboxes), filename
  
  def __len__(self):
return len(self.images)
def load_img(self, idx):
    img, labels, bboxes, _ = self[idx]
for label, (x1, y1, x2, y2) in zip(labels, bboxes):
      img = cv2.rectangle(
          img, (x1, y1), (x2, y2), (0, 255, 0), thickness=2
)
      img = cv2.putText(
          img, label, (x1, y1), cv2.FONT_HERSHEY_SIMPLEX, .5, (255, 255, 255), 1
)
    plt.imshow(img)

We define 2 keys function for preprocessing:

extract_regions leverages selective search to fetch region proposals of a given image

compute_iou returns the intersection over union (IoU score) of 2 bboxes

Python

Copy

def extract_regions(img):
  _, regions = selectivesearch.selective_search(img, scale=200, min_size=100)
  img_area = np.prod(img.shape[:2])
  seen, candidates = [], []
for region in regions:
if (
not region["rect"] in seen and
      region["size"] >= (img_area * 0.05) and
      region["size"] <= img_area
    ):
      x, y, w, h = region["rect"]
      seen.append(list(region["rect"]))
      candidates.append([x, y, x+w, y+h])
return np.array(candidates)
def compute_iou(bbox1, bbox2):
"""
  bbox: (x1, y1, x2, y2), x1 < x2 and y1 < y2
  """
  eps = 1e-5
  max_x1 = max(bbox1[0], bbox2[0])
  max_y1 = max(bbox1[1], bbox2[1])
  min_x2 = min(bbox1[2], bbox2[2])
  min_y2 = min(bbox1[3], bbox2[3])
  width = (min_x2 - max_x1)
  height = (min_y2 - max_y1)
if width < 0 or height < 0:
return 0.0
  area_within = height * width
  area_bbox1 = (bbox1[2] - bbox1[0]) * (bbox1[3] - bbox1[1])
  area_bbox2 = (bbox2[2] - bbox2[0]) * (bbox2[3] - bbox2[1])
  area_total = area_bbox1 + area_bbox2 - area_within
  return area_within / (area_total + eps)

It’s now time to create region proposals and compute deltas from the ground truth boxes

For a single image, we compute the IoU between each object and region proposal box (aka “candidate”)

For each candidate, we associate the closest ground truth box using IoU. If IoU < 0.3, the candidate is considered as background

The deltas of this image is the difference between the ground truth box and the candidate box

We normalize region proposals and deltas by the image dimensions (HxW)

To lighten the process we only use 500 images

Python

Copy

preprocess = {
"paths": [],
"classes": [],
"gtbbs": [],
"rois": [],
"deltas": [],
"ious": [],
}
N = 500
for idx, (img, labels, gtbbs, filename) in tqdm(enumerate(ds_img), total=N):
if idx == N:
break
  H, W = img.shape[:2]
  candidates = extract_regions(img)
  candidates_ious = []
for bbox in gtbbs:
    candidates_ious.append(
[compute_iou(bbox, candidate) for candidate in candidates]
)
  candidates_ious = np.array(candidates_ious).T # row_idx: candidate, col_idx: bbox
  classes, rois, deltas, ious = [], [], [], []
for jdx, candidate in enumerate(candidates):
    candidate_ious = candidates_ious[jdx]
    best_ious_idx = np.argmax(candidate_ious) # bbox_idx
    best_ious = candidate_ious[best_ious_idx]
if best_ious > 0.3:
      candidate_clss = labels[best_ious_idx]
else:
      candidate_clss = "background"
    bx, by, bX, bY = gtbbs[best_ious_idx]
    cx, cy, cX, cY = candidate
    delta = np.array([
(bx - cx),
(by - cy),
(bX - cX),
(bY - cY)
])
    norm = np.array([W, H, W, H])
    classes.append(candidate_clss)
    rois.append(candidate/norm)
    deltas.append(delta/norm)
    ious.append(best_ious)
  preprocess["paths"].append(filename)
  preprocess["gtbbs"].append(gtbbs)
  preprocess["rois"].append(rois)
  preprocess["classes"].append(classes)
  preprocess["deltas"].append(deltas)
  preprocess["ious"].append(ious)

Simple labelisation

Shell

Copy

label2idx = {'Bus': 0, 'Truck': 2, 'background': 1}
idx2label = {0: 'Bus', 1: 'background', 2: 'Truck'}

We then define the second Dataset class that will fetch data directly used by our model

__getitem__ loads the image a second time, convert color from BGR to RGB

(both [..., ::-1] and cvtColor(img, cv2.COLOR_BGR2RGB) have the same effect) and generate a list of crops by using the rois from selective search

collate_fn resize all crops to the same size, normalize by 255, permute channels (224, 224, 3) → (3, 224, 224), and normalize weights with VGG16 pre-trained mean and std.

Python

Copy

normalize = transforms.Normalize(
    mean=[0.485, 0.456, 0.406],
    std=[0.229, 0.224, 0.225]
)
def preprocess_img(img):
    img = torch.tensor(img).permute(2, 0, 1)
    img = normalize(img)
return img.to(device).float()
class RCNNDataset(Dataset):
def __init__(self, paths, gtbbs, rois, classes, deltas, ious):
    self.paths = paths
    self.gtbbs = gtbbs
    self.rois = rois
    self.classes = classes
    self.deltas = deltas
    self.ious = ious
  
  def __len__(self):
return len(self.paths)
def __getitem__(self, idx):
    fpath = self.paths[idx]
    img = cv2.imread(fpath, 1)[...,::-1]
#img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    H, W = img.shape[:2]
    rois = np.array(self.rois[idx])
    sh = np.array([W, H, W, H])
    bboxes = (rois * sh).astype(np.uint16)
    crops = [img[y1:y2, x1:x2] for (x1, y1, x2, y2) in bboxes]
    labels = np.array(self.classes[idx])
    deltas = np.array(self.deltas[idx])
#gtbbs = self.gtbbs[idx]
return crops, labels, deltas
  
  def collate_fn(self, batch):
    inputs, labels, deltas = [], [], []
for crops, img_labels, img_deltas in batch:
      crops = [cv2.resize(crop, (224, 224)) for crop in crops]
      crops = [preprocess_img(crop / 255.)[None] for crop in crops]
      inputs.extend(crops)
      labels.extend([label2idx[label] for label in img_labels])
      deltas.extend(img_deltas)
    inputs = torch.cat(inputs).to(device)
    labels = torch.tensor(labels).long().to(device)
    deltas = torch.tensor(deltas).float().to(device)
return inputs, labels, deltas

We can now create our dataloaders

We manually split train (90%) and test (10%) sets

We then instantiate our datasets and dataloaders with a batch size of only 2 since there are close to 40 crops per image

Python

Copy

def get_data(preprocess):
	val_idx = int(0.9 * N)
	path_train, path_val = preprocess["paths"][:val_idx], preprocess["paths"][val_idx:]
	gtbbs_train, gtbbs_val = preprocess["gtbbs"][:val_idx], preprocess["gtbbs"][val_idx:]
	rois_train, rois_val = preprocess["rois"][:val_idx], preprocess["rois"][val_idx:]
	classes_train, classes_val = preprocess["classes"][:val_idx], preprocess["classes"][val_idx:]
	deltas_train, deltas_val = preprocess["deltas"][:val_idx], preprocess["deltas"][val_idx:]
	ious_train, ious_val = preprocess["ious"][:val_idx], preprocess["ious"][val_idx:]
	
	ds_train = RCNNDataset(
	  path_train, gtbbs_train, rois_train, classes_train, deltas_train, ious_train
	)
	ds_val = RCNNDataset(
	  path_val, gtbbs_val, rois_val, classes_val, deltas_val, ious_val
	)
print(len(ds_train), len(ds_val))

	dl_train = DataLoader(
	  ds_train, batch_size=2, collate_fn=ds_train.collate_fn, drop_last=True
)
	dl_val = DataLoader(
	  ds_val, batch_size=2, collate_fn=ds_val.collate_fn, drop_last=True
)
return ds_train, ds_val, dl_train, dl_val	

Define the model class

get_vgg_backbone downloads the pretrained vgg16 checkpoint and load it onto its architecture.

We overwrite the classifier with an empty Sequential module and freezes the weights

RCNN adds 2 outputs to the backbone, one regressor for bboxes and one classifier for labels

The total loss is computed as cls_loss + reg_loss * lambda with lambda = 10

why?

Python

Copy

def get_vgg_backbone():
  vgg_backbone = models.vgg16(pretrained=True)
  in_features = list(vgg_backbone.classifier.children())[0].in_features
  vgg_backbone.classifier = nn.Sequential()
for param in vgg_backbone.parameters():
    param.requires_grad = False
return vgg_backbone, in_features


class RCNN(nn.Module):
def __init__(self):
super().__init__()
    vgg_backbone, in_features = get_vgg_backbone()
    self.backbone = vgg_backbone
    self.bbox = nn.Sequential(
      nn.Linear(in_features, 512),
      nn.ReLU(),
      nn.Linear(512, 4),
      nn.Tanh()
)
    self.cls_score = nn.Sequential(nn.Linear(in_features, 3))
    self.loss_bbox = nn.L1Loss()
    self.loss_cls = nn.CrossEntropyLoss()
def forward(self, x):
    x = self.backbone(x)
    bbox = self.bbox(x)
    cls = self.cls_score(x)
return bbox, cls

  def calc_loss(self, deltas_hat, deltas, labels_hat, labels):
    lambda_reg = 10
    loss_labels = self.loss_cls(labels_hat, labels)
    idxs, = torch.where(labels != 1)
if len(idxs) > 0:
      loss_bbox = self.loss_bbox(deltas_hat[idxs], deltas[idxs])
      loss_total =  loss_labels + lambda_reg * loss_bbox
      return loss_total, loss_bbox.item(), loss_labels.item()
else:
      loss_bbox = 0
      loss_total = loss_labels
      return loss_total, loss_bbox, loss_labels.item()

We then train the model and back propagate the total loss as seen in previous chapters

Inference time! Let’s try to detect objects within a test image, by reusing components from the training

Python

Copy

def get_inputs(img):
	candidates = extract_regions(img)
	inputs = []
for (x1, y1, x2, y2) in candidates:
		crop = img_copy[y1:y2, x1:x2]
		crop = cv2.resize(crop, (224, 224))
input = preprocess_img(crop/255.)[None]
	  inputs.append(input)
return torch.cat(inputs).to(device), candidates


@torch.no_grad()
def predict(model, inputs):
  model.eval()
  deltas_hat, probs = model(inputs)
  probs = nn.functional.softmax(probs, -1)
  confs, labels_hat = torch.max(probs, -1)
return [
    t.detach().float().cpu().numpy() for t in [deltas_hat, confs, labels_hat]
]
def generate_bboxes(deltas_hat, confs, labels_hat, candidates):
  idxs, = np.where(labels_hat != 1)
  deltas_hat_ = deltas_hat[idxs]
  labels_hat_ = labels_hat[idxs]
  candidates_ = candidates[idxs]
  confs_ = confs[idxs]
  bboxes_hat_ = (deltas_hat_ + candidates_).astype(np.uint16)
  idxs = ops.nms(
    torch.tensor(bboxes_hat_.astype(np.float32)), 
    torch.tensor(confs_), 
    iou_threshold=0.05
)
  bboxes_hat_ = bboxes_hat_[idxs]
  labels_hat_ = labels_hat_[idxs]
  confs_ = confs_[idxs]
if len(idxs) == 1:
    bboxes_hat_ = bboxes_hat_[None]
    labels_hat_ = labels_hat_[None]
    confs_ = confs_[None]
return bboxes_hat_, labels_hat_, confs_


def inference(filename, model):
	img = cv2.imread(filename)[...,::-1]
	inputs, candidates = get_inputs()
	deltas_hat, confs_hat, labels_hat = predict(model, inputs)
	bboxes, labels, confs generate_bboxes(deltas_hat, confs_hat, label_hat, candidates)
for (x1, y1, x2, y2), label, conf in zip(bboxes, labels, confs):
	  img = cv2.rectangle(img_copy, (x1, y1), (x2, y2), (0, 255, 0), 2)
	plt.imshow(img_copy);
	plt.title("Inference");

Additional resources

Selective search

https://learnopencv.com/selective-search-for-object-detection-cpp-python/

Select Search CS231b

http://vision.stanford.edu/teaching/cs231b_spring1415/slides/ssearch_schuyler.pdf

Non max suppression

https://pyimagesearch.com/2015/02/16/faster-non-maximum-suppression-python/

R-CNN paper

https://arxiv.org/pdf/1311.2524.pdf