Lab 3: Finetuning (Transfer Learning)

Quick links: Schedule | Forum | BRUTE | Lectures | Labs

Lab 3: Finetuning (Transfer Learning)

The main task is to fine-tune a pretrained CNN for a new classification task (transfer learning).

Skills: data loader from an image folder, data preprocessing, loading pretrained models, remote GPU servers, training part of the model. Insights: convolutional filters, error case analysis

Introduction

In this lab we start from a model already pretrained on the ImageNet classification dataset (1000 categories and 1.2 million images) and try to adjust it for solving a small-scale but otherwise challenging classification problem.

This will allow to work with a large-scale model at moderate computational expenses, since our fine-tuning dataset is small.
We will see that a pretrained network has already learned powerful visual features, which will greatly simplify our task.
We will consider two fine-tuning variants: adjusting the last layer or all layers.

Setup

GPU Servers

It is a good time now to start working with GPU servers. Check How To page. The recommended setup is as follows:

SSH authentication with pre-shared keys
VS Code “Remote - SSH” extension
Lmod configuration loaded via a the “Python Wrapper” method

Beware: VScode tends to keep the connection active even after you turn off your computer. As the GPU memory is expensive, login to the server regularly and check if your processes still occupy some GPUs. You may call pkill -f ipykernel to kill these processes.

Model

SOTA pretrained architectures are available in PyTorch. We will use the following models:

VGG11 https://pytorch.org/hub/pytorch_vision_vgg/, which was the model considered in the CNN lecture.
Squeezenet1_0 https://pytorch.org/hub/pytorch_vision_squeezenet/, which has much fewer parameters and uses ‘fire’ modules similar to the example in CNN lecture slide 18. It will be about 4 times faster to train but achieves somewhat lower accuracy on Imagenet.

import torchvision.models
model1 = torchvision.models.vgg11(weights=torchvision.models.VGG11_Weights.DEFAULT)
model2 = torchvision.models.squeezenet1_0(weights=torchvision.models.SqueezeNet1_0_Weights.DEFAULT)

You can see the structure of the loaded model by calling print(model). You can also open the source defining the network architecture https://github.com/pytorch/vision/blob/master/torchvision/models/vgg.py. Usually it is defined as a hierarchy of Modules, where each Module is either an elementary layer (e.g. Conv2d, Linear, ReLU) or a container (e.g. Sequential).

Data

The data will be placed in /local/temporary/butterflies/ on both servers (for a faster access and to avoid multiple copies). You can also download the dataset (e.g. to use on your computer):

Butterflies (35Mb)

The dataset contain color images 224×224 pixels of 10 categories. The scientific (Latin) names of the butterfly categories are:

01: Danaus plexippus	
02: Heliconius charitonius	
03: Heliconius erato	
04: Junonia coenia	
05: Lycaena phlaeas
06: Nymphalis antiopa	
07: Papilio cresphontes	
08: Pieris rapae	
09: Vanessa atalanta	
10: Vanessa cardui

Template

template.zip

This lab is substantially renewed this year, please let us know of any problems you encounter with the template or the task.

Part 1: Visualization of First Layer Filters and Features (1p)

The first task will be just to load the pretrained network, apply it to test image and visualize the convolution filters and activations in the first layer. For this task, squeezenet is more suitable as it has 7×7 convolution filters in the first layer. There are a couple of technicalities, prepared in the template.

Load the test image, transform it to the expected input of the network (type, shape, scaling)
Use the network to compute class predictive probabilities and report the top 5 classes and their probabilities.
Display weights of the first convolutional layer as images, in a grid of 8 x 12 (SqueezeNet has 96 channels in the first layer)
Apply the First linear layer of the network to the input image and display the resulting activation maps for the first 16 channels (e.g. as a grid of 4 x 4 images). Observe the result before and after non-linearity. Hint: the Sequential container supports slicing, so that model.features[0:2] is a small neural network consisting of the first few layers.

Part 2: Data Preprocessing and Loaders (2p)

To address the classification task, we first need to load in the data: create dataset, split into training and validation, create loaders. Fortunately, there are convenient tools for all the steps. The respective technicalities are prepared in the template.

Create dataset for all training images. We use existing tool datasets.ImageFolder:

from torchvision import datasets, transforms
train_data = datasets.ImageFolder('/local/temporary/butterflies/train', transform.ToTensor())
train_loader = torch.utils.data.DataLoader(train_data, batch_size=1, shuffle=True, num_workers=0)

Let us verify the statistics of the data are matching to those used to standardize the inputs for ImageNet. On the whole training set compute mean and standard deviation per color channel over all pixels and all images in the training set. Think how to do it incrementally with mini-batches, not loading the whole dataset into memory at once. You should obtain values similar to those of ImageNet:
```
mean=[0.485, 0.456, 0.406] 
std=[0.229, 0.224, 0.225]
```
Recreate the dataset using the transform 'transforms.Compose([transforms.ToTensor(), transforms.Normalize(mean=mean, std=std)])' with your computed standardization.
From the train dataset create two loaders: the loader used for optimizing hyperparameters (train_loader) and the loader used for validation (val_loader). Use the sampler argument of DataLoader with SubsetRandomSampler. Use random subsets instead of just slicing the dataset: you should not assume that the dataset is randomly shuffled (and in this task it is really not).

Part 3: Finetuninig (4p)

We will first try to learn only the last layer of the network on the new data. I.e. we will use the network as a feature extractor and learn a linear classifier on top of it, as if it was a logistic regression model on some features. This task is somewhat simpler with VGG architecture (SqueezeNet uses fully convolutional architecture and global pooling in the end).

Load the vgg11 model
Move the model to GPU model.to(dev)
Freeze all parameters of the model, so that they will not be trained, by
```
for param in model.parameters():
    param.requires_grad = False
```
Setting model.train(False) will fix the behaviour of batchnorm and dropout layers (if present) to deterministic input-independent
In you model architecture identify and delete the “classifier” part that maps “features” to scores of 1000 ImageNet classes.
Add a new “classifier” module that consists of one or more linear layers, with randomly initialized weights and outputs scores for 10 classes (our datasets). If we construct Linear layers anew, their parameters are automatically randomly initialized and have the attribute requires_grad = True by default, i.e. will be trainable.
Train the network for 10+ epochs. Use higher-level tools: optimizer and nll_loss as proposed in the template. When loading the data, move the data to GPU as well, note to(dev) is not an in-place operation for Tensors, unlike for Modules.
Choose the best learning rate and the stopping epoch by cross-validation. Select the learning rate from $0.01, 0.03, 0.001, 0.003, 0.0001$. In order to apply cross-validation, use the val_loader to evaluate the validation accuracy in the end of each training epoch. Select the model that achieves the best validation accuracy over all of the learning rates and training epochs. Save the best network using torch.save. See Saving / Loading Tutorial .
Report the full setup of learning that you used: base network, classifier architecture, optimizer, learning rate, grid for hyper-parameters search and the selected hyper-parameters. Report logs (or plots) of training and validation metrics (loss, accuracy) versus epochs for the selected hyper-parameters (learning rate).
Repeat the learning experiment, however this time allow all parameters of the neural network to be updated (no freezing). Take care that the train-validation split would be the same. Report the same metrics for this approach.

Report the final test classification accuracy of the best model (selected on the validation set). The test set is specified as a separate folder:

test_data = datasets.ImageFolder('/local/temporary/butterflies/test', transform)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=8, shuffle=False, num_workers=0)

Use the same input transform as for training. Do not re-tune the hyperparameters to achieve a better test set perforomance! The network will probably make a few errors on the test set. For these cases display and report: 1) the input test image, 2) its correct class label, 3) the class labels and network confidence (predictive probabilities) of the top 3 network predictions (classes with highest predictive probability).

Part 4: Data Augmentation (3pt)

Because we have very limited training / testing data available, it is a good idea to use also data augmentation. Let us select some transforms, which can be expected to result in realistic images of the same class. A possible set is

RandomHorizontalFlip
RandomAffine
RandomAdjustSharpness

See Torchvision transform examples. Note that transforms inherit torch.nn.Module and therefore can be used the same way as layers, or as functions applied to data Tensors (however, not batched). They can be also built-in the Dataset by setting the transform argument. They can process PIL.Image or a Tensor. For efficiently reasons it is better to use them as functions on Tensors.

Create a Composite transform with a small random effect strength (e.g. rotation up to 10 degrees, etc) of each kind from our list or your list.
Apply this transform at all stages: training, validation, testing. If incorporated in the Dataset, be sure to keep the input standardization transform as used previously. Adjust the validation / testing procedure to account for the extra randomness, e.g. average results over several draws of the random transform for each data point.
Evaluate the test performance of the best previously trained saved model.
Train the linear classifier with frozen features again, this time with data augmentation. A somewhat longer training duration would be appropriate, e.g. 30+ epochs. Apply the same cross-validation and testing protocol as before.

Table of Contents