Chapter 5: Transfert Learning

Overview

VGG16

Resnet

Multi-regression: key facial point detection

Multi task learning: age estimation + gender classification

Google Colaboratory

https://colab.research.google.com/drive/1c5FV9RlFU-b6DiOjjhtCP-37OJxbcrRY

Overview

Transfert Learning consists in fine-tuning a model that was pre-trained on a huge generic dataset, using a specific dataset of interest.

We leverage knowledge gained from a task to another similar task

High level flow:

Normalize the input images by the same mean and variance use for the pre-train model

Fetch the weight and architecture and load the pre-train model

Truncate some last layers of the model, and froze the remaining weights, as we don’t want to train this model another time

Connect the truncated model to randomly initialized layers, with output size of the last layer matching the number of class to detect

Update the trainable weights over epochs to fit a model

VGG16

VGG stands for Visual Geometry Group, 16 is the number of layers of the model

Use torchsummary to get a clean overview of the architecture

Python

Copy

!pip install torchsummary
from torchsummary import summary
from torchvision import models

model = models.vgg16(pretrained=True)
summary(model, size=(3, 224, 224)) # size=(channel, H, W), put any H, W

Output

Download cats and dogs dataset from Kaggle

You will need to create a Kaggle token API on your Kaggle account, a kaggle.json file is automatically download

Upload the kaggle.json file into colab when asked by the UI at files.upload()

Python

Copy

from google.colab import files
files.upload()
!pip install -q kaggle
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!ls ~/.kaggle
!chmod 600 /root/.kaggle/kaggle.json

!kaggle datasets download -d tongpython/cat-and-dog
!unzip -q cat-and-dog.zip

Create our Dataset class

Python

Copy

class CatsDogsDataset(Dataset):
def __init__(self, folder):
    cats = glob(f"{folder}/cats/*.jpg")
    dogs = glob(f"{folder}/dogs/*.jpg")
    self.fpaths = cats[:500] + dogs[:500]
    shuffle(self.fpaths)
    self.targets = [
      fpath.split("/")[-1].startswith("dog") for fpath in self.fpaths
    ]
    self.normalize = transforms.Normalize(
			mean=[0.485, 0.456, 0.406],
      std=[0.229, 0.224, 0.225],
)
def __len__(self):
return len(self.targets)
def __getitem__(self, idx):
    f = self.fpaths[idx]
    target = self.targets[idx]
    img = cv2.imread(f)[:, :, ::-1] # BGR -> RGB
    img = cv2.resize(img, (224, 224))
    img = torch.tensor(img/255)
    img = img.permute(2, 0, 1) # (H, W, C) -> (C, H, W)
    img = self.normalize(img)
return img.float().to(device), torch.tensor([target]).float().to(device)

targets (y) are binary: 1 for dog, 0 for cat

normalize is a standard operation that always take the same values in PyTorch

images need to be

converted from BGR to RGB

resized to the pre-trained network input

scaled between 0 and 1

normalized like the pre-trained network

mean and std normalization use always the same values (see PyTorch Source Code)

we return both the image and its target

Create our model function

Python

Copy

def get_model():
	model = models.vgg16(pretrained=True)
for param in model.parameters():
		param.requires_grad = False
	model.avgpool = nn.AdaptativeAvgPool2d(output_size=(1, 1))
	model.classifier = nn.Sequential(
		nn.Flatten(),
		nn.Linear(512, 128),
		nn.ReLU(),
		nn.Dropout2d(.2),
		nn.Linear(128, 1),
		nn.Sigmoid(),
)
	loss_fn = nn.BCELoss()
	optimizer = optim.Adam(model.parameters(), lr=1e-3)
return model.to(device), loss_fn, optimizer 

Freeze all parameters during update, and overwrite avg pool and final classifier

Adaptative pool is an average pooling layer with a twist: instead of defining a kernel size, we define a feature map size, so that the output has always the same size, hence the network can accept images of any dimensions.

Ex: if our input dimension is 512 * k * k, the kernel size will be k * k

Most of the training script of the chapter 3 remains valid, with a few updates:

Add a threshold to get_accuracy 

Python

Copy

@torch.no_grad()
def get_accuracy(X, y, model):
  model.eval()
  y_hat = model(X)
  is_correct = (y_hat > .5) == y
  return is_correct.cpu().numpy().tolist()

We are able to get 98% accuracy 

Looking at VGG11 and VGG19, we observe respectively slightly worse and slightly better performances

However, we can’t just adding layers and make the network deeper, because

Vanishing gradient will arise

More parameters to update

Too much information modification at deep layers

Resnet comes to rescue and addresses when to learn

Resnet

Upon building deep networks, two problem arises:

Last layers close to output have no clue what the original image was

Gradients of first layers is near to zero

Using residual block, we can propagate the original input, so that the network can focus on extracting features, and not seeking to rebuild the input

Implementation

Python

Copy

class ResLayer(nn.Module):
def __init__(self, n_i, n_o, kernel_size, stride=1):
super().__init__()
		padding = kernel_size - 2
		self.conv = nn.Sequential(
			nn.Conv2d(n_i, n_o, kernel_size, stride, padding=padding),
			nn.ReLU(),
)
def forward(self, x):
return self.conv(x) + x

Architecture of ResNet18

18 blocks total, with skip connections every 2 blocks

97% accuracy with only 1000 images

Other popular pre-trained models are Inception, MobileNet, DenseNet, and SqueezeNet

Multi-regression: key facial point detection

Google Colaboratory

https://colab.research.google.com/drive/1sY2QTI01ES9lMNQBgY5lmyB8WFggH2EU#scrollTo=NKSihkh9Gedn

Challenges:

Image size can vary, so we need to scale our keypoints as well

After normalization, keypoint coordinates are always between 0 and 1, so we can use sigmoid at the end of the network

Download keypoint data

Python

Copy

!git clone https://github.com/udacity/P1_Facial_Keypoints.git
!cd P1_Facial_Keypoints
train_dir = 'P1_Facial_Keypoints/data/training/'
train_df = pd.read_csv("P1_Facial_Keypoints/data/training_frames_keypoints.csv")
train_df.head()

column “0” is keypoint_1_x, column “1” is keypoint_1_y

Dataset class

Python

Copy

class KeypointDataset(Dataset):
def __init__(self, df, img_dir):
super().__init__()
    self.img_dir = img_dir
    self.normalize = transforms.Normalize(
      mean=[0.485, 0.456, 0.406],
      std=[0.229, 0.224, 0.225],
)
    self.df = df

  def __len__(self):
return len(self.df)
def __getitem__(self, idx):
    row = deepcopy(self.df.iloc[idx])
    img_path = os.path.join(self.img_dir, row[0])
    img = cv2.imread(img_path) / 255
    img = self.preprocess_img(img)
    kp_xy = row[1:].tolist()
    kp_x = np.array(kp_xy[0::2] / img.shape[0]).tolist()
    kp_y = np.array(kp_xy[1::2] / img.shape[1]).tolist()
    kp = torch.tensor(kp_x + kp_y)
return img, kp
  
  def preprocess_img(self, img):
    img = cv2.resize(img, (224, 224))
    img = torch.tensor(img).permute(2, 0, 1)
    img = self.normalize(img).float()
return img.to(device)
def load_img(self, idx):
"""for debug and viz purposes only"""
    img_file = df.iloc[idx, 0]
    img_path = os.path.join(self.img_dir, img_file)
    img = cv2.imread(img_path) / 255
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    img = cv2.resize(img, (224, 224))
return img

df is the dataset of image path and keypoints

All inputs need to be set as tensors

Normalize image by 255 and normalize again using the standard mean and std of pretrained models

Create loaders

Python

Copy

def get_data(df, img_dir):
  train_df, test_df = train_test_split(df, test_size=0.2)
  
  train_dataset = KeypointDataset(train_df.reset_index(drop=True), img_dir)
  test_dataset = KeypointDataset(test_df.reset_index(drop=True), img_dir)

  train_dl = DataLoader(train_dataset, batch_size=32)
  test_dl = DataLoader(test_dataset, batch_size=32)
return train_dl, test_dl

split train and test on the training data, so that we use the validation dataset later on

Model need a few twicks as well

Python

Copy

def get_model():
  model = models.vgg16(pretrained=True)
for param in model.parameters():
    param.require_grad = False
  model.avgpool = nn.Sequential(
      nn.Conv2d(512, 512, 3),
      nn.MaxPool2d(2),
      nn.Flatten(),
)
  model.classifier = nn.Sequential(
      nn.Linear(2048, 512),
      nn.ReLU(),
      nn.Dropout(0.5),
      nn.Linear(512, 136),
      nn.Sigmoid()
)
  loss_fn = nn.l1_loss
	optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
return model.to(device), loss_fn, optimizer

AvgPool has the same input_channel as output_channel, with a kernel size of 3

⇒ Add Stanford rule for dimension computing for CNNs

The training procedure stays the same

Inference

Shell

Copy

ix = 0
im = test_dataset.load_img(ix)
x, _ = test_dataset[ix]
kp = model(x[None]).flatten().detach().cpu()

plt.figure(figsize=(10,10))
plt.subplot(221)
plt.title('Original image')
plt.imshow(im)
plt.grid(False)

plt.subplot(222)
plt.title('Image with facial keypoints')
plt.imshow(im)
plt.scatter(kp[:68]*224, kp[68:]*224, c='r')
plt.grid(False)
plt.show()

x[None] add one dimension to the image, simulating a batch of a single element

keypoints need to be rescaled to the image width and height

detach remove the vector from the gradient graph

check the state of the art in face pose estimation

https://github.com/1adrianb/face-alignment/blob/master/face_alignment/detection/blazeface/net_blazeface.py

https://www.adrianbulat.com/downloads/FG20/fast_human_pose.pdf

Multi task learning: age estimation + gender classification

Google Colaboratory

https://colab.research.google.com/drive/1OCBMT_edF0dtik3ABD7oGurfAGQ7l621#scrollTo=cFWuVvESz9UY

How to predict 2 different attribute for the same image, at the same time?

Our new plan is to

Use a pre-trained model, freeze all its layers except the last

Create a divergence on the last layer, and use a continuous loss for age and a binary cross entropy loss for gender

Add the two loss and backpropagate

Dataset

Python

Copy

class AgeGenderDataset(Dataset):
def __init__(self, df):
super().__init__()
    self.df = df
    self.normalize = transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225],
)
def __len__(self):
return len(self.df)
def __getitem__(self, idx):
    row = self.df.iloc[idx]
    age = row.age
    gender = row.gender == "Male"
    f = row.file
    img = cv2.imread(f)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
return img, age, gender
  
  def collate_fn(self, batch):
    list_img, list_age, list_gender = [], [], []
for img, age, gender in batch:
      img = self.img_preprocess(img)
      age = float(age/80)
      gender = float(gender)
      list_img.append(img)
      list_age.append(age)
      list_gender.append(gender)
    img = torch.cat(list_img).to(device)
    age = torch.tensor(list_age).to(device).float()
    gender = torch.tensor(list_gender).to(device).float()
return img, age, gender
  
  def img_preprocess(self, img):
    img = cv2.resize(img, (224, 224))
    img = torch.tensor(img).permute(2, 0, 1)
    img = self.normalize(img/255)
return img[None]

__getitem__ returns feature img and targets age, gender

all preprocessing is done through the collate_fn, called by the DataLoader, with the data processed as batch, instead of individually through __getitem__

img_preprocess needs to permute channels, normalize by 255 and by the pretrained coefficient, and add a dimension to simulate a list: img[None]. 

Preprocessed images have dimension (1, C, H, W) and their list is then concatenated, so that torch tensor has dimension (N, C, H, W)

DataLoader

Python

Copy

train_ds = AgeGenderDataset(train_df)
val_ds = AgeGenderDataset(val_df)

train_dl = DataLoader(
    train_ds, batch_size=32, shuffle=True, collate_fn=train_ds.collate_fn
)
val_dl = DataLoader(
    val_ds, batch_size=32, shuffle=True, collate_fn=val_ds.collate_fn
)

DataLoader implementes collate_fn, defined as a class method for convenience.

check your implementation with

Python

Copy

a,b,c, = next(iter(train_dl))
print(a.shape, b.shape, c.shape)
# torch.Size([32, 3, 224, 224]) torch.Size([32]) torch.Size([32])

Model

Python

Copy

def get_model():
  model = models.vgg16(pretrained=True)
for param in model.parameters():
    param.require_grad = False
  model.avgpool = nn.Sequential(
      nn.Conv2d(512, 512, 3),
      nn.MaxPool2d(2),
			nn.ReLU(),
      nn.Flatten(),
)
  model.classifier = AgeGenderClassifier()
  loss_gender = nn.BCELoss()
  loss_age = nn.L1Loss()
  optimizer = Adam(model.parameters(), lr=1e-3)
return model.to(device), (loss_age, loss_gender), optimizer

Freeze again all parameters by setting require_grad to False

Overwrite avgpool by a convolutional layer, followed by a flatten operator

Overwrite classifier with a custom age gender module

2 losses are defined: one continuous for age: L1Loss, one categorical for gender: BCELoss, and returned as a tuple

AgeGenderClassifier

Python

Copy

class AgeGenderClassifier(nn.Module):
def __init__(self):
super().__init__()
    self.intermediate = nn.Sequential(
        nn.Linear(2048, 512),
        nn.ReLU(),
        nn.Dropout(0.2),
        nn.Linear(512, 128),
        nn.ReLU(),
        nn.Dropout(0.2),
)
    self.age_regressor = nn.Sequential(
      nn.Linear(128, 1),
      nn.Sigmoid()
)
    self.gender_classifier = nn.Sequential(
      nn.Linear(128, 1),
      nn.Sigmoid()
)
def forward(self, x):
    x = self.intermediate(x)
    age = self.age_regressor(x)
    gender = self.gender_classifier(x)
return age, gender

Contrary to the previous get_model function, there is no method overwrite here. Methods names are defined freely and called during forward

Final layer diverge between age and gender: forward gets x as input and return both age and gender

Training method

Python

Copy

def train_batch(data, model, loss_fns, optimizer):
  model.train()
  optimizer.zero_grad()

  img, age, gender = data
  age_pred, gender_pred = model(img)

  loss_age_fn, loss_gender_fn = loss_fns
  loss_age = loss_age_fn(age_pred.squeeze(), age)
  loss_gender = loss_gender_fn(gender_pred.squeeze(), gender)
  loss_total = loss_age + loss_gender

  loss_total.backward()
  optimizer.step()
return loss_total.item()

Feed both loss function with age and gender

The loss that we back-propagate on is the sum of both losses

Gender accuracy is close to 84% and Age prediction is off by 6 years

Inference

Python

Copy

!wget https://www.dropbox.com/s/6kzr8l68e9kpjkf/5_9.JPG
img = cv2.imread('/content/5_9.JPG')
img = train_ds.preprocess_image(img).to(device)
age, gender = model(img)
pred_gender = gender.to("cpu").detach().numpy()
age_gender = age.to("cpu").detach().numpy()

img = cv2.imread('/content/5_9.JPG')
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
plt.imshow(img)
gender = {1: "Male", 0: "Female"}[gender[0][0] > .5]
age = int(age[0][0] * 80)
print(f"Predicted gender: {gender}, predicted age: {age}")

prediction must be sent to cpu, detached (untracked for backpropagation) and turn into numpy array for display purposes

My personal mistakes during implementation

Forgot to add super().__init__ to my custom module

Mistake on the shape of buffer in collate_fn: buffer is a list of element

float() must be called after to(device)

squeeze() after prediction is needed