/
...
/
/
πŸŒ”
Chapter 7: Basics of object detection
Search
Duplicate
Try Notion
πŸŒ”πŸŒ”

Chapter 7: Basics of object detection

Overview

Object detection comes handy when there are multiple classes to detect in a single image, and when the object is much smaller than the image
2 outputs: classes and a bounding boxes
Steps to train a detection model
Create a dataset of images, with all bounding boxes position and labels
Scan images to get region of interests, that may contain objects.
Selective search and anchor box are 2 common techniques for region proposals
Make a prediction, and compute the Intersection over Union (IoU) class
Compute the bounding box offset to rectify bounding boxes found in 2.
Create a model using classes and bounding boxes
Measure accuracy of the detection using mean average precision (MAP)
Intersection over Union (IoU)
IoU = 0 when there is no overlap between the prediction and the ground truth
IoU = 1 when the two boxes are superposed

Creating a detection dataset

Install the cvat annotation tool by cloning the project:
cvat​
Choose your installation method, I recommend using Docker
Entire workflow
Shell
Copy
$ git clone https://github.com/openvinotoolkit/cvat.git $ cd cvat $ docker-compose up -d $ docker exec -it cvat bash -ic 'python3 ~/manage.py createsuperuser'
​
Next, open your browser and connect to http://localhost:8080/
Here I already have a project created, but you won’t
Click β€œCreate a new task”, and specify the labels that you want to detect
Follow the workflow, including uploading images
Start labelling
Click on the left rectangular shape, choose your label and draw the bounding box around the object
For detection task, we can only draw rectangles even if it seems imperfect
Later on when studying image segmentation, we will fit the mask to the shape we want to detect
You can add several boxes on the same image
Once you have labeled the image, click on the β€œ>” button at the top to display the next one
Click on β€œSave” from time to time to save your progresses
When you have labelled all your data, export the dataset by clicking on β€œMenu” > β€œExport task”
Choose the format of the dataset depending of the model
Pascal VOC is a frequent choice when working with NVIDIA models for exemple
Dataset folder is made of:
labelmap.txt
Plain Text
Copy
# label:color_rgb:parts:actions background:0,0,0:: tape:128,0,0::
​
Here I only have one class to detect for the entire dataset, β€œtape”
β€œBackground” is always added, and is the class when no object is present in the image
One xml file is generated for each image
XML
Copy
<annotation> <folder>IMG</folder> <filename>IMG_20211210_154131.jpg</filename> <source> <database>Unknown</database> <annotation>Unknown</annotation> <image>Unknown</image> </source> <size> <width>4608</width> <height>3456</height> <depth></depth> </size> <segmented>0</segmented> <object> <name>tape</name> <truncated>0</truncated> <occluded>0</occluded> <difficult>0</difficult> <bndbox> <xmin>1240.67</xmin> <ymin>1368.33</ymin> <xmax>4301.7</xmax> <ymax>1809.15</ymax> </bndbox> </object> </annotation>
​

Region proposals

Goal: to identify pixel with similar values and creates cluster
SelectiveSearch over-segments image in small patches based on pixel intensity using graph-based segmentation method
Over segmentation
Loop:
Add bounding boxes segment to region proposals
group adjacent segment based on similarity

Selective Search Algorithm

Off-the-shelf felzenswalb
Python
Copy
from skimage.segmentation import felzenszwalb segments_fz = felzenszwalb(img, scale=200)
​
Pixel with similar values are clustered together into a region proposal
The neural networks only needs to determine whether a region is an object or a background
In this is an object, this region helps us create the bounding box offset and determine its class
Complete selective selective search (using Felzenswalb internally)
Python
Copy
!pip install -q selectivesearch import selectivesearch def extract_candidates(img): img_lbl, regions = selectivesearch.selective_search(img, scale=200, min_size=100) img_area = np.prod(img.shape[:2]) candidates = [] for r in regions: if (not r["rect"] in candidates) \ and (r["size"] > img_area * 0.05) \ and (r["size"] < img_area): candidates.append(r["rect"]) return candidates
​

Non-max suppression

In the tiger image above, many boxes are overlapping
If we obtained these boxes from a detection model and each box is mapped to a score, non-max suppression (nms) will order boxes by confidence score.
Then it will remove all boxes with a lesser confidence above a IoU threshold with the reference box
Otherwise, if run after segmentation β€”like aboveβ€” we don’t have scores. So, we just order boxes using an arbitrary axis (like y2) before also removing boxes by using IoU.
Implementation
Python
Copy
def NMS(boxes, overlapThresh=0.3): if len(boxes) == 0: return [] x1 = boxes[:, 0] # x coordinate of the top-left corner y1 = boxes[:, 1] # y coordinate of the top-left corner x2 = boxes[:, 2] # x coordinate of the bottom-right corner y2 = boxes[:, 3] # y coordinate of the bottom-right corner areas = (x2 - x1 + 1) * (y2 - y1 + 1) # We add 1, because the pixel at the start as well as at the end counts indices = np.arange(len(x1)) for idx, box in enumerate(boxes): temp_indices = indices[indices != idx] xx1 = np.maximum(box[0], boxes[temp_indices, 0]) yy1 = np.maximum(box[1], boxes[temp_indices, 1]) xx2 = np.minimum(box[2], boxes[temp_indices, 2]) yy2 = np.minimum(box[3], boxes[temp_indices, 3]) w = np.maximum(0, xx2 - xx1 + 1) h = np.maximum(0, yy2 - yy1 + 1) overlap = (w * h) / areas[temp_indices] # if the actual boungding box has an overlap bigger than treshold with any other box, remove it's index if np.any(overlap) > overlapThresh: indices = indices[indices != idx] return boxes[indices].astype(int)
​
In our case, it return a single box almost as wide as the image itself. That’s because the tiger occupies the entire image.

Mean Average Precision

Precision is defined as Precision=TPTP+FPPrecision=\frac{TP}{TP+FP}​
True positive are bounding box with correct class and IoU with the ground truth above some threshold
False positive are bounding box with wrong class or IoU below some threshold
If there are multiple boxes identified, only one can be TP, the rest are FP
2 metrics
Average precision: average precision for various IoU thresholds
mAP: average precision for various IoU thresholds across all classes of the dataset

R-CNN (Region-Based CNN)

Steps
Generate region proposal from image (many redundancy, we need to avoid false negative)
Warp each region into a image of fixed size
Forward region proposals into a pre-trained networks (ResNet50, VGG16) an extract features in a fully connected layer
Create data where inputs are the extracted features and output are the class corresponding to each region proposal, and the bbox offset from the ground truth
Two outputs: bbox regressor and label classifier
Define loss to minimize object classification error and bbox offset error, then back-propagate it

Implementation

We begin by downloading a kaggle subset of Google Open Image project, including only trucks and buses
Python
Copy
files.upload() !mkdir ~/.kaggle !mv kaggle.json ~/.kaggle !chmod 600 ~/.kaggle/kaggle.json !kaggle datasets download -d sixhky/open-images-bus-trucks/ !unzip -qq open-images-bus-trucks.zip
​
Again, you’ll need to upload your kaggle.json API key, find it on your Kaggle profile
We load the dataset. Each line is an object, i.e. a box with a label, one image can have several objects
Python
Copy
df = pd.read_csv("df.csv")
​
We then need to define a first Dataset class to extract image, box and label from the dataframe
We index the dataset using image name uniqueness
Then we filter the dataframe for each image name: each row is an object of the given image
We return the image and its associated list of labels and boxes
Python
Copy
class ImageDataset(Dataset): def __init__(self, path, df): self.path = path self.df = df self.images = df.ImageID.unique() def __getitem__(self, idx): image_id = self.images[idx] filename = os.path.join(self.path, image_id+".jpg") img = cv2.imread(filename) img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) h, w = img.shape[:2] df_image = self.df.loc[self.df.ImageID == image_id] cols = df_image.columns col2idx = dict(zip(cols, range(len(cols)))) bboxes, labels = [], [] for row in df_image.values: label = row[col2idx["LabelName"]] bbox = [ int(row[col2idx["XMin"]] * w), int(row[col2idx["YMin"]] * h), int(row[col2idx["XMax"]] * w), int(row[col2idx["YMax"]] * h), ] labels.append(label) bboxes.append(bbox) return img, np.array(labels), np.array(bboxes), filename def __len__(self): return len(self.images) def load_img(self, idx): img, labels, bboxes, _ = self[idx] for label, (x1, y1, x2, y2) in zip(labels, bboxes): img = cv2.rectangle( img, (x1, y1), (x2, y2), (0, 255, 0), thickness=2 ) img = cv2.putText( img, label, (x1, y1), cv2.FONT_HERSHEY_SIMPLEX, .5, (255, 255, 255), 1 ) plt.imshow(img)
​
We define 2 keys function for preprocessing:
extract_regions leverages selective search to fetch region proposals of a given image
compute_iou returns the intersection over union (IoU score) of 2 bboxes
Python
Copy
def extract_regions(img): _, regions = selectivesearch.selective_search(img, scale=200, min_size=100) img_area = np.prod(img.shape[:2]) seen, candidates = [], [] for region in regions: if ( not region["rect"] in seen and region["size"] >= (img_area * 0.05) and region["size"] <= img_area ): x, y, w, h = region["rect"] seen.append(list(region["rect"])) candidates.append([x, y, x+w, y+h]) return np.array(candidates) def compute_iou(bbox1, bbox2): """ bbox: (x1, y1, x2, y2), x1 < x2 and y1 < y2 """ eps = 1e-5 max_x1 = max(bbox1[0], bbox2[0]) max_y1 = max(bbox1[1], bbox2[1]) min_x2 = min(bbox1[2], bbox2[2]) min_y2 = min(bbox1[3], bbox2[3]) width = (min_x2 - max_x1) height = (min_y2 - max_y1) if width < 0 or height < 0: return 0.0 area_within = height * width area_bbox1 = (bbox1[2] - bbox1[0]) * (bbox1[3] - bbox1[1]) area_bbox2 = (bbox2[2] - bbox2[0]) * (bbox2[3] - bbox2[1]) area_total = area_bbox1 + area_bbox2 - area_within return area_within / (area_total + eps)
​
It’s now time to create region proposals and compute deltas from the ground truth boxes
For a single image, we compute the IoU between each object and region proposal box (aka β€œcandidate”)
For each candidate, we associate the closest ground truth box using IoU. If IoU < 0.3, the candidate is considered as background
The deltas of this image is the difference between the ground truth box and the candidate box
We normalize region proposals and deltas by the image dimensions (HxW)
To lighten the process we only use 500 images
Python
Copy
preprocess = { "paths": [], "classes": [], "gtbbs": [], "rois": [], "deltas": [], "ious": [], } N = 500 for idx, (img, labels, gtbbs, filename) in tqdm(enumerate(ds_img), total=N): if idx == N: break H, W = img.shape[:2] candidates = extract_regions(img) candidates_ious = [] for bbox in gtbbs: candidates_ious.append( [compute_iou(bbox, candidate) for candidate in candidates] ) candidates_ious = np.array(candidates_ious).T # row_idx: candidate, col_idx: bbox classes, rois, deltas, ious = [], [], [], [] for jdx, candidate in enumerate(candidates): candidate_ious = candidates_ious[jdx] best_ious_idx = np.argmax(candidate_ious) # bbox_idx best_ious = candidate_ious[best_ious_idx] if best_ious > 0.3: candidate_clss = labels[best_ious_idx] else: candidate_clss = "background" bx, by, bX, bY = gtbbs[best_ious_idx] cx, cy, cX, cY = candidate delta = np.array([ (bx - cx), (by - cy), (bX - cX), (bY - cY) ]) norm = np.array([W, H, W, H]) classes.append(candidate_clss) rois.append(candidate/norm) deltas.append(delta/norm) ious.append(best_ious) preprocess["paths"].append(filename) preprocess["gtbbs"].append(gtbbs) preprocess["rois"].append(rois) preprocess["classes"].append(classes) preprocess["deltas"].append(deltas) preprocess["ious"].append(ious)
​
Simple labelisation
Shell
Copy
label2idx = {'Bus': 0, 'Truck': 2, 'background': 1} idx2label = {0: 'Bus', 1: 'background', 2: 'Truck'}
​
We then define the second Dataset class that will fetch data directly used by our model
__getitem__ loads the image a second time, convert color from BGR to RGB
(both [..., ::-1] and cvtColor(img, cv2.COLOR_BGR2RGB) have the same effect) and generate a list of crops by using the rois from selective search
collate_fn resize all crops to the same size, normalize by 255, permute channels (224, 224, 3) β†’ (3, 224, 224), and normalize weights with VGG16 pre-trained mean and std.
Python
Copy
normalize = transforms.Normalize( mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225] ) def preprocess_img(img): img = torch.tensor(img).permute(2, 0, 1) img = normalize(img) return img.to(device).float() class RCNNDataset(Dataset): def __init__(self, paths, gtbbs, rois, classes, deltas, ious): self.paths = paths self.gtbbs = gtbbs self.rois = rois self.classes = classes self.deltas = deltas self.ious = ious def __len__(self): return len(self.paths) def __getitem__(self, idx): fpath = self.paths[idx] img = cv2.imread(fpath, 1)[...,::-1] #img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) H, W = img.shape[:2] rois = np.array(self.rois[idx]) sh = np.array([W, H, W, H]) bboxes = (rois * sh).astype(np.uint16) crops = [img[y1:y2, x1:x2] for (x1, y1, x2, y2) in bboxes] labels = np.array(self.classes[idx]) deltas = np.array(self.deltas[idx]) #gtbbs = self.gtbbs[idx] return crops, labels, deltas def collate_fn(self, batch): inputs, labels, deltas = [], [], [] for crops, img_labels, img_deltas in batch: crops = [cv2.resize(crop, (224, 224)) for crop in crops] crops = [preprocess_img(crop / 255.)[None] for crop in crops] inputs.extend(crops) labels.extend([label2idx[label] for label in img_labels]) deltas.extend(img_deltas) inputs = torch.cat(inputs).to(device) labels = torch.tensor(labels).long().to(device) deltas = torch.tensor(deltas).float().to(device) return inputs, labels, deltas
​
We can now create our dataloaders
We manually split train (90%) and test (10%) sets
We then instantiate our datasets and dataloaders with a batch size of only 2 since there are close to 40 crops per image
Python
Copy
def get_data(preprocess): val_idx = int(0.9 * N) path_train, path_val = preprocess["paths"][:val_idx], preprocess["paths"][val_idx:] gtbbs_train, gtbbs_val = preprocess["gtbbs"][:val_idx], preprocess["gtbbs"][val_idx:] rois_train, rois_val = preprocess["rois"][:val_idx], preprocess["rois"][val_idx:] classes_train, classes_val = preprocess["classes"][:val_idx], preprocess["classes"][val_idx:] deltas_train, deltas_val = preprocess["deltas"][:val_idx], preprocess["deltas"][val_idx:] ious_train, ious_val = preprocess["ious"][:val_idx], preprocess["ious"][val_idx:] ds_train = RCNNDataset( path_train, gtbbs_train, rois_train, classes_train, deltas_train, ious_train ) ds_val = RCNNDataset( path_val, gtbbs_val, rois_val, classes_val, deltas_val, ious_val ) print(len(ds_train), len(ds_val)) dl_train = DataLoader( ds_train, batch_size=2, collate_fn=ds_train.collate_fn, drop_last=True ) dl_val = DataLoader( ds_val, batch_size=2, collate_fn=ds_val.collate_fn, drop_last=True ) return ds_train, ds_val, dl_train, dl_val
​
Define the model class
get_vgg_backbone downloads the pretrained vgg16 checkpoint and load it onto its architecture.
We overwrite the classifier with an empty Sequential module and freezes the weights
RCNN adds 2 outputs to the backbone, one regressor for bboxes and one classifier for labels
The total loss is computed as cls_loss + reg_loss * lambda with lambda = 10
why?
Python
Copy
def get_vgg_backbone(): vgg_backbone = models.vgg16(pretrained=True) in_features = list(vgg_backbone.classifier.children())[0].in_features vgg_backbone.classifier = nn.Sequential() for param in vgg_backbone.parameters(): param.requires_grad = False return vgg_backbone, in_features class RCNN(nn.Module): def __init__(self): super().__init__() vgg_backbone, in_features = get_vgg_backbone() self.backbone = vgg_backbone self.bbox = nn.Sequential( nn.Linear(in_features, 512), nn.ReLU(), nn.Linear(512, 4), nn.Tanh() ) self.cls_score = nn.Sequential(nn.Linear(in_features, 3)) self.loss_bbox = nn.L1Loss() self.loss_cls = nn.CrossEntropyLoss() def forward(self, x): x = self.backbone(x) bbox = self.bbox(x) cls = self.cls_score(x) return bbox, cls def calc_loss(self, deltas_hat, deltas, labels_hat, labels): lambda_reg = 10 loss_labels = self.loss_cls(labels_hat, labels) idxs, = torch.where(labels != 1) if len(idxs) > 0: loss_bbox = self.loss_bbox(deltas_hat[idxs], deltas[idxs]) loss_total = loss_labels + lambda_reg * loss_bbox return loss_total, loss_bbox.item(), loss_labels.item() else: loss_bbox = 0 loss_total = loss_labels return loss_total, loss_bbox, loss_labels.item()
​
We then train the model and back propagate the total loss as seen in previous chapters
Inference time! Let’s try to detect objects within a test image, by reusing components from the training
Python
Copy
def get_inputs(img): candidates = extract_regions(img) inputs = [] for (x1, y1, x2, y2) in candidates: crop = img_copy[y1:y2, x1:x2] crop = cv2.resize(crop, (224, 224)) input = preprocess_img(crop/255.)[None] inputs.append(input) return torch.cat(inputs).to(device), candidates @torch.no_grad() def predict(model, inputs): model.eval() deltas_hat, probs = model(inputs) probs = nn.functional.softmax(probs, -1) confs, labels_hat = torch.max(probs, -1) return [ t.detach().float().cpu().numpy() for t in [deltas_hat, confs, labels_hat] ] def generate_bboxes(deltas_hat, confs, labels_hat, candidates): idxs, = np.where(labels_hat != 1) deltas_hat_ = deltas_hat[idxs] labels_hat_ = labels_hat[idxs] candidates_ = candidates[idxs] confs_ = confs[idxs] bboxes_hat_ = (deltas_hat_ + candidates_).astype(np.uint16) idxs = ops.nms( torch.tensor(bboxes_hat_.astype(np.float32)), torch.tensor(confs_), iou_threshold=0.05 ) bboxes_hat_ = bboxes_hat_[idxs] labels_hat_ = labels_hat_[idxs] confs_ = confs_[idxs] if len(idxs) == 1: bboxes_hat_ = bboxes_hat_[None] labels_hat_ = labels_hat_[None] confs_ = confs_[None] return bboxes_hat_, labels_hat_, confs_ def inference(filename, model): img = cv2.imread(filename)[...,::-1] inputs, candidates = get_inputs() deltas_hat, confs_hat, labels_hat = predict(model, inputs) bboxes, labels, confs generate_bboxes(deltas_hat, confs_hat, label_hat, candidates) for (x1, y1, x2, y2), label, conf in zip(bboxes, labels, confs): img = cv2.rectangle(img_copy, (x1, y1), (x2, y2), (0, 255, 0), 2) plt.imshow(img_copy); plt.title("Inference");
​

Additional resources

Selective search
Select Search CS231b
Non max suppression
R-CNN paper