Lab 3: Finetuning (Transfer Learning)

Quick links: Schedule | Forum | BRUTE | Lectures | Labs

Lab 3: Finetuning (Transfer Learning)

The main task is to fine-tune a pretrained CNN for a new classification task (transfer learning).

Skills: data loader from an image folder, data preprocessing, loading pretrained models, remote GPU servers, training part of the model. Insights: convolutional filters, error case analysis

Introduction

In this lab we start from a model already pretrained on the ImageNet classification dataset (1000 categories and 1.2 million images) and try to adjust it for solving a small-scale but otherwise challenging classification problem.

This will allow to work with a large-scale model at moderate computational expenses, since our fine-tuning dataset is small.
We will see that a pretrained network has already learned powerful visual features, which will greatly simplify our task.
We will consider two fine-tuning variants: adjusting the last layer or all layers.

Setup

GPU Servers

It is a good time now to start working with GPU servers. Check How To page. The recommended setup is as follows:

SSH authentication with pre-shared keys
VS Code “Remote - SSH” extension
Lmod configuration loaded via a the “Python Wrapper” method

Beware: VScode tends to keep the server daemon active even after you turn off your computer. As the GPU memory is expensive, login to the server regularly and check if your processes still occupy some GPUs. You may call pkill -f ipykernel to kill these processes.

Model

SOTA pretrained architectures are available in PyTorch. We will use the following models:

Squeezenet1_0 https://pytorch.org/hub/pytorch_vision_squeezenet/, which is a fast and small model but achieves somewhat lower accuracy on ImageNet.
ResNet-18 https://pytorch.org/hub/pytorch_vision_resnet/, which accieves better performance on ImageNet and is more stable in training.

import torchvision.models
model1 = torchvision.models.squeezenet1_0(weights=torchvision.models.SqueezeNet1_0_Weights.DEFAULT)
model2 = torchvision.models.resnet18(pretrained=True)

You can see the structure of the loaded model by calling print(model). You can also open the source defining the network architecture https://github.com/pytorch/vision/blob/main/torchvision/models/resnet.py. Usually it is defined as a hierarchy of Modules, where each Module is either an elementary layer (e.g. Conv2d, Linear, ReLU) or a container (e.g. Sequential).

Data

The data will be placed in /local/temporary/Datasets/PACS_cartoon and in /local/temporary/Datasets/PACS_cartoon_few_shot on both servers (for a faster access and to avoid multiple copies). You can also download the dataset (e.g. to use on your computer):

PACS_cartoon (84Mb)

The PACS_cartoon dataset contain colored images of cartoons with 227×227 pixels and of 7 categories:

01: Dog
02: Elephant
03: Giraffe	
04: Guitar	
05: Horse
06: House
07: Person

Template

This lab is substantially renewed this year, please let us know of any problems you encounter with the template or the task.

Part 1: Visualization of First Layer Filters and Features (1p)

The first task will be just to load the pretrained network, apply it to test image and visualize the convolution filters and activations in the first layer. For this task, squeezenet is more suitable as it has 7×7 convolution filters in the first layer. There are a couple of technicalities, prepared in the template.

Load the test image, transform it to the expected input of the network (type, shape, scaling)
Use the network to compute class predictive probabilities and report the top 5 classes and their probabilities.
Display weights of the first convolutional layer as images, in a grid of 8 x 12 (SqueezeNet has 96 channels in the first layer)
Apply the First linear layer of the network to the input image and display the resulting activation maps for the first 16 channels (e.g. as a grid of 4 x 4 images). Observe the result before and after non-linearity. Hint: the Sequential container supports slicing, so that model.features[0:2] is a small neural network consisting of the first few layers.

Part 2: Data Preprocessing and Loaders (1p)

To address the classification task, we first need to load in the data: create dataset, split into training and validation, create loaders. Fortunately, there are convenient tools for all the steps. The respective technicalities are prepared in the template.

Create dataset for all training images. We use existing tool datasets.ImageFolder:

from torchvision import datasets, transforms
train_data = datasets.ImageFolder('/local/temporary/Datasets/PACS_cartoon/train', transforms.ToTensor())
train_loader = torch.utils.data.DataLoader(train_data, batch_size=1, shuffle=True, num_workers=0)

Let us calculate the statistics of the data. Are they matching those used to standardize the inputs for ImageNet? On the whole training set compute mean and standard deviation per color channel over all pixels and all images in the training set. Think how to do it incrementally with mini-batches, not loading the whole dataset into memory at once. The values for ImageNet are:
```
mean=[0.485, 0.456, 0.406] 
std=[0.229, 0.224, 0.225]
```
Adapt the dataset using the transform 'transforms.Compose([transforms.ToTensor(), transforms.Normalize(mean=mean, std=std)])' with your computed standardization.
From the train dataset create two loaders: the loader used for optimizing hyperparameters (train_loader) and the loader used for validation (val_loader). Use the sampler argument of DataLoader with SubsetRandomSampler. Use random subsets instead of just slicing the dataset: you should not assume that the dataset is randomly shuffled (and in this task it is really not).
The partition of train and val is important to be the same for all the following parts so using a seed is suggested.

Part 3: Training From Scratch (1p)

We will investigate the benefits of using a pre-trained network, even if the distribution of our task is different from the pre-training (e.g. train on cartoons with a network pre-trained on photographs). First let's check the performance of a model trained on cartoons from scratch:

Create ResNet-18 model with a random initialization and move it to a GPU

model = torchvision.models.resnet18(pretrained=False)
model.to(dev)

In your model architecture identify and delete the “classifier” part that maps “features” to scores of 1000 ImageNet classes and add a new “classifier” module that consists of one linear layer, with randomly initialized weights and outputs scores for 7 classes (our datasets).
Train the network for $50$ epochs. Use higher-level tools: optimizer and nll_loss as proposed in the template. When loading the data, move the data to GPU as well, note to(dev) is not an in-place operation for Tensors, unlike for Modules.
Choose the best learning rate and the stopping epoch by cross-validation. Select the learning rate from $0.01, 0.03, 0.001, 0.003, 0.0001$. In order to apply cross-validation, use the val_loader to evaluate the validation accuracy in the end of each training epoch. Select the model that achieves the best validation accuracy over all of the learning rates and training epochs. Save the best network using torch.save. See Saving / Loading Tutorial .
Report the full setup of learning that you used: Base network, optimizer, learning rate, grid for hyper-parameters search and the selected hyper-parameters (including the epoch you chose to stop). Report logs (or plots) of training and validation metrics (loss, accuracy) versus epochs for the selected hyper-parameters (learning rate).

Report the final test classification accuracy of the best model (selected on the validation set). The test set is specified as a separate folder:

test_data = datasets.ImageFolder('/local/temporary/Datasets/PACS_cartoon/test', transform)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=8, shuffle=False, num_workers=0)

Use the same input transform as for training. Do not re-tune the hyperparameters to achieve a better test set perforomance! The network will probably make some errors on the test set. For a few of these cases display and report: 1) the input test image, 2) its correct class label, 3) the class labels and network confidence (predictive probabilities) of the top 3 network predictions (classes with highest predictive probability).

Part 4: Fine-Tuning the Classifier (1p)

Without changing the train-validation dataset split reload the model pretrained
```
model = torchvision.models.resnet18(pretrained=True)
model.to(dev)
```
Freeze all parameters of the model, so that they will not be trained, by
```
for param in model.parameters():
    param.requires_grad = False
```
Setting model.train(False) will fix the behaviour of batchnorm and dropout layers (if present) to deterministic input-independent
Replace the “classifier” module with a linear layer, with randomly initialized weights and outputs scores for 7 classes like in step 2 of part 3. If we construct a linear layer now, their parameters are automatically randomly initialized and have the attribute requires_grad = True by default, i.e. they will be trainable.
Train only the classifier of the model by doing the same hyperparameter search as in the previous training (step 4 of part 3). Report the test accuracy of the best model.

Part 5: Full Fine-Tuning (1p)

Without changing the train-validation dataset split reload the model pretrained.
Replace the “classifier” module with a linear layer, with randomly initialized weights and outputs scores for 7 classes like in step 2 of part 3.
Train the full model by doing the same hyperparameter search as in the previous training (step 4 of part 3). Report the test accuracy of the best model.
Compare the three test accuracies (training from scratch, fine-tuning only the classifier, fine-tuning the full model). What do you observe? Discuss your findings in the report.

Part 6: Few-Shot (3pt)

Repeat the parts 3-5 for the PACS_cartoon_few_shot dataset. This dataset is located in /local/temporary/Datasets/PACS_cartoon_few_shot and it has significantly less training data. Report your findings and discuss the difference in terms of performance on the three trainings in PACS_cartoon versus the three trainings in PACS_cartoon_few_shot.

Part 7: Data Augmentation (2pt)

In PACS_cartoon_few_shot the training data are very limited. A good practice is to use data augmentations during training. Select some transforms, which can be expected to result in a more diverse dataset. A possible set is

RandomHorizontalFlip
RandomAffine
RandomAdjustSharpness
ColorJitter

See Pytorch transform examples. Note that transforms inherit torch.nn.Module and therefore can be used the same way as layers, or as functions applied to data Tensors (however, not batched). They can be also built-in the Dataset by setting the transform argument. They can process PIL.Image or a Tensor. For efficiently reasons it is better to use them as functions on Tensors.

Create a Composite transform with a small random effect strength (e.g. rotation up to 10 degrees, etc) of each kind from our list or your list.
Apply this transform only during training and not during validation or testing.
Repeat Part 5 for the PACS_cartoon_few_shot dataset only but with augmentations. Did you manage to improve your metrics compared to the non-augmented performance?

Table of Contents