Chapter 4: CNN

The issue with vanilla NN

Building blocks of CNN

https://colab.research.google.com/drive/1k1Qer_sX16i_6UCjemJzcS9L0_rBdXPg#scrollTo=E0S8YQQ7mnp6

The issue with vanilla NN

We use np.roll to shift background pixels from left to right of a Trouser image, so that the object isn’t centered anymore

Beyond 2 pixels, the correct probability drops significantly, so we can’t rely on our previous model for generalization

Building blocks of CNN

Convolution

CNN filter is a matrix of weights initialized randomly at first

Different filters represent different patterns of features to detect in the image, that can be activated

If we convolve a 4x4 grayscale image through 10 different 2x2 filters, the output matrix shape will be 3x3x10, there are as many channels as filters

If we use a color image with 3 channels, like 28x28x3, each filter will also have 3 channels

Padding

Add a external layer of 0 to maintain image size after convolving

Pooling

Aggregates data into a smaller matrix, most usual: max, mean and sum

Here with a stride of size 2

The max pooling output is

Pooling allows to abstract a region, making the model more robust to change (translating a row of pixel to the right might not change the output)

Flattening

Convolution and pooling help obtaining a image representation with a much lower dimension that original

This representation can then be treated as a vanilla NN input, like we did in the previous chapter

Implementation

PyTorch expect input matrix to has shape (N, C, H, W), with N the number of images, C the number of channels and H, W the image dimensions

In our Dataset class, the view is

Python

Copy

x = x.view(-1, 1, 28, 28)

Our model become:

Python

Copy

nn.Sequential(
	nn.Conv2d(1, 64, kernel_size=3),
	nn.MaxPool2d(2),
  nn.ReLU(),
	nn.Conv2d(64, 128, kernel_size=3),
	nn.MaxPool2d(2),
	nn.ReLU(),
	nn.Flatten(),
	nn.Linear(128 * 5 * 5, 256),
	nn.ReLu(),
	nn.Linear(256, 10),
).to(device)

The image translation problem output has significantly improve

But still room for improvement beyond translation of 4 pixels