Overview
Object detection comes handy when there are multiple classes to detect in a single image, and when the object is much smaller than the image
2 outputs: classes and a bounding boxes
Steps to train a detection model
Create a dataset of images, with all bounding boxes position and labels
Scan images to get region of interests, that may contain objects.
Selective search and anchor box are 2 common techniques for region proposals
Make a prediction, and compute the Intersection over Union (IoU) class
Compute the bounding box offset to rectify bounding boxes found in 2.
Create a model using classes and bounding boxes
Measure accuracy of the detection using mean average precision (MAP)
Intersection over Union (IoU)
IoU = 0 when there is no overlap between the prediction and the ground truth
IoU = 1 when the two boxes are superposed
Creating a detection dataset
Install the cvat annotation tool by cloning the project:
cvatβ
Choose your installation method, I recommend using Docker
Entire workflow
Shell
Copy
$ git clone https://github.com/openvinotoolkit/cvat.git
$ cd cvat
$ docker-compose up -d
$ docker exec -it cvat bash -ic 'python3 ~/manage.py createsuperuser'
Next, open your browser and connect to http://localhost:8080/
Here I already have a project created, but you wonβt
Click βCreate a new taskβ, and specify the labels that you want to detect
Follow the workflow, including uploading images
Start labelling
Click on the left rectangular shape, choose your label and draw the bounding box around the object
For detection task, we can only draw rectangles even if it seems imperfect
Later on when studying image segmentation, we will fit the mask to the shape we want to detect
You can add several boxes on the same image
Once you have labeled the image, click on the β>β button at the top to display the next one
Click on βSaveβ from time to time to save your progresses
When you have labelled all your data, export the dataset by clicking on βMenuβ > βExport taskβ
Choose the format of the dataset depending of the model
Pascal VOC is a frequent choice when working with NVIDIA models for exemple
Dataset folder is made of:
labelmap.txt
Plain Text
Copy
# label:color_rgb:parts:actions
background:0,0,0::
tape:128,0,0::
Here I only have one class to detect for the entire dataset, βtapeβ
βBackgroundβ is always added, and is the class when no object is present in the image
One xml file is generated for each image
XML
Copy
<annotation>
<folder>IMG</folder>
<filename>IMG_20211210_154131.jpg</filename>
<source>
<database>Unknown</database>
<annotation>Unknown</annotation>
<image>Unknown</image>
</source>
<size>
<width>4608</width>
<height>3456</height>
<depth></depth>
</size>
<segmented>0</segmented>
<object>
<name>tape</name>
<truncated>0</truncated>
<occluded>0</occluded>
<difficult>0</difficult>
<bndbox>
<xmin>1240.67</xmin>
<ymin>1368.33</ymin>
<xmax>4301.7</xmax>
<ymax>1809.15</ymax>
</bndbox>
</object>
</annotation>
Region proposals
Goal: to identify pixel with similar values and creates cluster
SelectiveSearch over-segments image in small patches based on pixel intensity using graph-based segmentation method
Over segmentation
Loop:
Add bounding boxes segment to region proposals
group adjacent segment based on similarity
Selective Search Algorithm
Off-the-shelf felzenswalb
Python
Copy
from skimage.segmentation import felzenszwalb
segments_fz = felzenszwalb(img, scale=200)
Pixel with similar values are clustered together into a region proposal
The neural networks only needs to determine whether a region is an object or a background
In this is an object, this region helps us create the bounding box offset and determine its class
Complete selective selective search (using Felzenswalb internally)
Python
Copy
!pip install -q selectivesearch
import selectivesearch
def extract_candidates(img):
img_lbl, regions = selectivesearch.selective_search(img, scale=200, min_size=100)
img_area = np.prod(img.shape[:2])
candidates = []
for r in regions:
if (not r["rect"] in candidates) \
and (r["size"] > img_area * 0.05) \
and (r["size"] < img_area):
candidates.append(r["rect"])
return candidates
Non-max suppression
In the tiger image above, many boxes are overlapping
If we obtained these boxes from a detection model and each box is mapped to a score, non-max suppression (nms) will order boxes by confidence score.
Then it will remove all boxes with a lesser confidence above a IoU threshold with the reference box
Otherwise, if run after segmentation βlike aboveβ we donβt have scores. So, we just order boxes using an arbitrary axis (like y2) before also removing boxes by using IoU.
Implementation
Python
Copy
def NMS(boxes, overlapThresh=0.3):
if len(boxes) == 0:
return []
x1 = boxes[:, 0] # x coordinate of the top-left corner
y1 = boxes[:, 1] # y coordinate of the top-left corner
x2 = boxes[:, 2] # x coordinate of the bottom-right corner
y2 = boxes[:, 3] # y coordinate of the bottom-right corner
areas = (x2 - x1 + 1) * (y2 - y1 + 1) # We add 1, because the pixel at the start as well as at the end counts
indices = np.arange(len(x1))
for idx, box in enumerate(boxes):
temp_indices = indices[indices != idx]
xx1 = np.maximum(box[0], boxes[temp_indices, 0])
yy1 = np.maximum(box[1], boxes[temp_indices, 1])
xx2 = np.minimum(box[2], boxes[temp_indices, 2])
yy2 = np.minimum(box[3], boxes[temp_indices, 3])
w = np.maximum(0, xx2 - xx1 + 1)
h = np.maximum(0, yy2 - yy1 + 1)
overlap = (w * h) / areas[temp_indices]
# if the actual boungding box has an overlap bigger than treshold with any other box, remove it's index
if np.any(overlap) > overlapThresh:
indices = indices[indices != idx]
return boxes[indices].astype(int)
In our case, it return a single box almost as wide as the image itself. Thatβs because the tiger occupies the entire image.
Mean Average Precision
Precision is defined as ο»Ώβ
True positive are bounding box with correct class and IoU with the ground truth above some threshold
False positive are bounding box with wrong class or IoU below some threshold
If there are multiple boxes identified, only one can be TP, the rest are FP
2 metrics
Average precision: average precision for various IoU thresholds
mAP: average precision for various IoU thresholds across all classes of the dataset
R-CNN (Region-Based CNN)
Steps
Generate region proposal from image (many redundancy, we need to avoid false negative)
Warp each region into a image of fixed size
Forward region proposals into a pre-trained networks (ResNet50, VGG16) an extract features in a fully connected layer
Create data where inputs are the extracted features and output are the class corresponding to each region proposal, and the bbox offset from the ground truth
Two outputs: bbox regressor and label classifier
Define loss to minimize object classification error and bbox offset error, then back-propagate it
Implementation
We begin by downloading a kaggle subset of Google Open Image project, including only trucks and buses
Python
Copy
files.upload()
!mkdir ~/.kaggle
!mv kaggle.json ~/.kaggle
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download -d sixhky/open-images-bus-trucks/
!unzip -qq open-images-bus-trucks.zip
Again, youβll need to upload your kaggle.json API key, find it on your Kaggle profile
We load the dataset. Each line is an object, i.e. a box with a label, one image can have several objects
Python
Copy
df = pd.read_csv("df.csv")
We then need to define a first Dataset class to extract image, box and label from the dataframe
We index the dataset using image name uniqueness
Then we filter the dataframe for each image name: each row is an object of the given image
We return the image and its associated list of labels and boxes
Python
Copy
class ImageDataset(Dataset):
def __init__(self, path, df):
self.path = path
self.df = df
self.images = df.ImageID.unique()
def __getitem__(self, idx):
image_id = self.images[idx]
filename = os.path.join(self.path, image_id+".jpg")
img = cv2.imread(filename)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
h, w = img.shape[:2]
df_image = self.df.loc[self.df.ImageID == image_id]
cols = df_image.columns
col2idx = dict(zip(cols, range(len(cols))))
bboxes, labels = [], []
for row in df_image.values:
label = row[col2idx["LabelName"]]
bbox = [
int(row[col2idx["XMin"]] * w),
int(row[col2idx["YMin"]] * h),
int(row[col2idx["XMax"]] * w),
int(row[col2idx["YMax"]] * h),
]
labels.append(label)
bboxes.append(bbox)
return img, np.array(labels), np.array(bboxes), filename
def __len__(self):
return len(self.images)
def load_img(self, idx):
img, labels, bboxes, _ = self[idx]
for label, (x1, y1, x2, y2) in zip(labels, bboxes):
img = cv2.rectangle(
img, (x1, y1), (x2, y2), (0, 255, 0), thickness=2
)
img = cv2.putText(
img, label, (x1, y1), cv2.FONT_HERSHEY_SIMPLEX, .5, (255, 255, 255), 1
)
plt.imshow(img)
We define 2 keys function for preprocessing:
extract_regions leverages selective search to fetch region proposals of a given image
compute_iou returns the intersection over union (IoU score) of 2 bboxes
Python
Copy
def extract_regions(img):
_, regions = selectivesearch.selective_search(img, scale=200, min_size=100)
img_area = np.prod(img.shape[:2])
seen, candidates = [], []
for region in regions:
if (
not region["rect"] in seen and
region["size"] >= (img_area * 0.05) and
region["size"] <= img_area
):
x, y, w, h = region["rect"]
seen.append(list(region["rect"]))
candidates.append([x, y, x+w, y+h])
return np.array(candidates)
def compute_iou(bbox1, bbox2):
"""
bbox: (x1, y1, x2, y2), x1 < x2 and y1 < y2
"""
eps = 1e-5
max_x1 = max(bbox1[0], bbox2[0])
max_y1 = max(bbox1[1], bbox2[1])
min_x2 = min(bbox1[2], bbox2[2])
min_y2 = min(bbox1[3], bbox2[3])
width = (min_x2 - max_x1)
height = (min_y2 - max_y1)
if width < 0 or height < 0:
return 0.0
area_within = height * width
area_bbox1 = (bbox1[2] - bbox1[0]) * (bbox1[3] - bbox1[1])
area_bbox2 = (bbox2[2] - bbox2[0]) * (bbox2[3] - bbox2[1])
area_total = area_bbox1 + area_bbox2 - area_within
return area_within / (area_total + eps)
Itβs now time to create region proposals and compute deltas from the ground truth boxes
For a single image, we compute the IoU between each object and region proposal box (aka βcandidateβ)
For each candidate, we associate the closest ground truth box using IoU. If IoU < 0.3, the candidate is considered as background
The deltas of this image is the difference between the ground truth box and the candidate box
We normalize region proposals and deltas by the image dimensions (HxW)
To lighten the process we only use 500 images
Python
Copy
preprocess = {
"paths": [],
"classes": [],
"gtbbs": [],
"rois": [],
"deltas": [],
"ious": [],
}
N = 500
for idx, (img, labels, gtbbs, filename) in tqdm(enumerate(ds_img), total=N):
if idx == N:
break
H, W = img.shape[:2]
candidates = extract_regions(img)
candidates_ious = []
for bbox in gtbbs:
candidates_ious.append(
[compute_iou(bbox, candidate) for candidate in candidates]
)
candidates_ious = np.array(candidates_ious).T # row_idx: candidate, col_idx: bbox
classes, rois, deltas, ious = [], [], [], []
for jdx, candidate in enumerate(candidates):
candidate_ious = candidates_ious[jdx]
best_ious_idx = np.argmax(candidate_ious) # bbox_idx
best_ious = candidate_ious[best_ious_idx]
if best_ious > 0.3:
candidate_clss = labels[best_ious_idx]
else:
candidate_clss = "background"
bx, by, bX, bY = gtbbs[best_ious_idx]
cx, cy, cX, cY = candidate
delta = np.array([
(bx - cx),
(by - cy),
(bX - cX),
(bY - cY)
])
norm = np.array([W, H, W, H])
classes.append(candidate_clss)
rois.append(candidate/norm)
deltas.append(delta/norm)
ious.append(best_ious)
preprocess["paths"].append(filename)
preprocess["gtbbs"].append(gtbbs)
preprocess["rois"].append(rois)
preprocess["classes"].append(classes)
preprocess["deltas"].append(deltas)
preprocess["ious"].append(ious)
Simple labelisation
Shell
Copy
label2idx = {'Bus': 0, 'Truck': 2, 'background': 1}
idx2label = {0: 'Bus', 1: 'background', 2: 'Truck'}
We then define the second Dataset class that will fetch data directly used by our model
__getitem__ loads the image a second time, convert color from BGR to RGB
(both [..., ::-1] and cvtColor(img, cv2.COLOR_BGR2RGB) have the same effect) and generate a list of crops by using the rois from selective search
collate_fn resize all crops to the same size, normalize by 255, permute channels (224, 224, 3) β (3, 224, 224), and normalize weights with VGG16 pre-trained mean and std.
Python
Copy
normalize = transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
def preprocess_img(img):
img = torch.tensor(img).permute(2, 0, 1)
img = normalize(img)
return img.to(device).float()
class RCNNDataset(Dataset):
def __init__(self, paths, gtbbs, rois, classes, deltas, ious):
self.paths = paths
self.gtbbs = gtbbs
self.rois = rois
self.classes = classes
self.deltas = deltas
self.ious = ious
def __len__(self):
return len(self.paths)
def __getitem__(self, idx):
fpath = self.paths[idx]
img = cv2.imread(fpath, 1)[...,::-1]
#img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
H, W = img.shape[:2]
rois = np.array(self.rois[idx])
sh = np.array([W, H, W, H])
bboxes = (rois * sh).astype(np.uint16)
crops = [img[y1:y2, x1:x2] for (x1, y1, x2, y2) in bboxes]
labels = np.array(self.classes[idx])
deltas = np.array(self.deltas[idx])
#gtbbs = self.gtbbs[idx]
return crops, labels, deltas
def collate_fn(self, batch):
inputs, labels, deltas = [], [], []
for crops, img_labels, img_deltas in batch:
crops = [cv2.resize(crop, (224, 224)) for crop in crops]
crops = [preprocess_img(crop / 255.)[None] for crop in crops]
inputs.extend(crops)
labels.extend([label2idx[label] for label in img_labels])
deltas.extend(img_deltas)
inputs = torch.cat(inputs).to(device)
labels = torch.tensor(labels).long().to(device)
deltas = torch.tensor(deltas).float().to(device)
return inputs, labels, deltas
We can now create our dataloaders
We manually split train (90%) and test (10%) sets
We then instantiate our datasets and dataloaders with a batch size of only 2 since there are close to 40 crops per image
Python
Copy
def get_data(preprocess):
val_idx = int(0.9 * N)
path_train, path_val = preprocess["paths"][:val_idx], preprocess["paths"][val_idx:]
gtbbs_train, gtbbs_val = preprocess["gtbbs"][:val_idx], preprocess["gtbbs"][val_idx:]
rois_train, rois_val = preprocess["rois"][:val_idx], preprocess["rois"][val_idx:]
classes_train, classes_val = preprocess["classes"][:val_idx], preprocess["classes"][val_idx:]
deltas_train, deltas_val = preprocess["deltas"][:val_idx], preprocess["deltas"][val_idx:]
ious_train, ious_val = preprocess["ious"][:val_idx], preprocess["ious"][val_idx:]
ds_train = RCNNDataset(
path_train, gtbbs_train, rois_train, classes_train, deltas_train, ious_train
)
ds_val = RCNNDataset(
path_val, gtbbs_val, rois_val, classes_val, deltas_val, ious_val
)
print(len(ds_train), len(ds_val))
dl_train = DataLoader(
ds_train, batch_size=2, collate_fn=ds_train.collate_fn, drop_last=True
)
dl_val = DataLoader(
ds_val, batch_size=2, collate_fn=ds_val.collate_fn, drop_last=True
)
return ds_train, ds_val, dl_train, dl_val
Define the model class
get_vgg_backbone downloads the pretrained vgg16 checkpoint and load it onto its architecture.
We overwrite the classifier with an empty Sequential module and freezes the weights
RCNN adds 2 outputs to the backbone, one regressor for bboxes and one classifier for labels
The total loss is computed as cls_loss + reg_loss * lambda with lambda = 10
why?
Python
Copy
def get_vgg_backbone():
vgg_backbone = models.vgg16(pretrained=True)
in_features = list(vgg_backbone.classifier.children())[0].in_features
vgg_backbone.classifier = nn.Sequential()
for param in vgg_backbone.parameters():
param.requires_grad = False
return vgg_backbone, in_features
class RCNN(nn.Module):
def __init__(self):
super().__init__()
vgg_backbone, in_features = get_vgg_backbone()
self.backbone = vgg_backbone
self.bbox = nn.Sequential(
nn.Linear(in_features, 512),
nn.ReLU(),
nn.Linear(512, 4),
nn.Tanh()
)
self.cls_score = nn.Sequential(nn.Linear(in_features, 3))
self.loss_bbox = nn.L1Loss()
self.loss_cls = nn.CrossEntropyLoss()
def forward(self, x):
x = self.backbone(x)
bbox = self.bbox(x)
cls = self.cls_score(x)
return bbox, cls
def calc_loss(self, deltas_hat, deltas, labels_hat, labels):
lambda_reg = 10
loss_labels = self.loss_cls(labels_hat, labels)
idxs, = torch.where(labels != 1)
if len(idxs) > 0:
loss_bbox = self.loss_bbox(deltas_hat[idxs], deltas[idxs])
loss_total = loss_labels + lambda_reg * loss_bbox
return loss_total, loss_bbox.item(), loss_labels.item()
else:
loss_bbox = 0
loss_total = loss_labels
return loss_total, loss_bbox, loss_labels.item()
We then train the model and back propagate the total loss as seen in previous chapters
Inference time! Letβs try to detect objects within a test image, by reusing components from the training
Python
Copy
def get_inputs(img):
candidates = extract_regions(img)
inputs = []
for (x1, y1, x2, y2) in candidates:
crop = img_copy[y1:y2, x1:x2]
crop = cv2.resize(crop, (224, 224))
input = preprocess_img(crop/255.)[None]
inputs.append(input)
return torch.cat(inputs).to(device), candidates
@torch.no_grad()
def predict(model, inputs):
model.eval()
deltas_hat, probs = model(inputs)
probs = nn.functional.softmax(probs, -1)
confs, labels_hat = torch.max(probs, -1)
return [
t.detach().float().cpu().numpy() for t in [deltas_hat, confs, labels_hat]
]
def generate_bboxes(deltas_hat, confs, labels_hat, candidates):
idxs, = np.where(labels_hat != 1)
deltas_hat_ = deltas_hat[idxs]
labels_hat_ = labels_hat[idxs]
candidates_ = candidates[idxs]
confs_ = confs[idxs]
bboxes_hat_ = (deltas_hat_ + candidates_).astype(np.uint16)
idxs = ops.nms(
torch.tensor(bboxes_hat_.astype(np.float32)),
torch.tensor(confs_),
iou_threshold=0.05
)
bboxes_hat_ = bboxes_hat_[idxs]
labels_hat_ = labels_hat_[idxs]
confs_ = confs_[idxs]
if len(idxs) == 1:
bboxes_hat_ = bboxes_hat_[None]
labels_hat_ = labels_hat_[None]
confs_ = confs_[None]
return bboxes_hat_, labels_hat_, confs_
def inference(filename, model):
img = cv2.imread(filename)[...,::-1]
inputs, candidates = get_inputs()
deltas_hat, confs_hat, labels_hat = predict(model, inputs)
bboxes, labels, confs generate_bboxes(deltas_hat, confs_hat, label_hat, candidates)
for (x1, y1, x2, y2), label, conf in zip(bboxes, labels, confs):
img = cv2.rectangle(img_copy, (x1, y1), (x2, y2), (0, 255, 0), 2)
plt.imshow(img_copy);
plt.title("Inference");
Additional resources
Selective search
Select Search CS231b
Non max suppression
R-CNN paper