The issue with vanilla NN
We use np.roll to shift background pixels from left to right of a Trouser image, so that the object isnβt centered anymore
Beyond 2 pixels, the correct probability drops significantly, so we canβt rely on our previous model for generalization
Building blocks of CNN
Convolution
CNN filter is a matrix of weights initialized randomly at first
Different filters represent different patterns of features to detect in the image, that can be activated
If we convolve a 4x4 grayscale image through 10 different 2x2 filters, the output matrix shape will be 3x3x10, there are as many channels as filters
If we use a color image with 3 channels, like 28x28x3, each filter will also have 3 channels
Padding
Add a external layer of 0 to maintain image size after convolving
Pooling
Aggregates data into a smaller matrix, most usual: max, mean and sum
Here with a stride of size 2
The max pooling output is
Pooling allows to abstract a region, making the model more robust to change (translating a row of pixel to the right might not change the output)
Flattening
Convolution and pooling help obtaining a image representation with a much lower dimension that original
This representation can then be treated as a vanilla NN input, like we did in the previous chapter
Implementation
PyTorch expect input matrix to has shape (N, C, H, W), with N the number of images, C the number of channels and H, W the image dimensions
In our Dataset class, the view is
Python
Copy
x = x.view(-1, 1, 28, 28)
Our model become:
Python
Copy
nn.Sequential(
nn.Conv2d(1, 64, kernel_size=3),
nn.MaxPool2d(2),
nn.ReLU(),
nn.Conv2d(64, 128, kernel_size=3),
nn.MaxPool2d(2),
nn.ReLU(),
nn.Flatten(),
nn.Linear(128 * 5 * 5, 256),
nn.ReLu(),
nn.Linear(256, 10),
).to(device)
The image translation problem output has significantly improve
But still room for improvement beyond translation of 4 pixels