Lab 2: Pytorch Pipeline

Pytorch, project pipeline, training/validation/test set, model selection (architecture), overfitting, early stopping, CNN MNIST Visualize ranking, t-SNE embedding.

Getting Started

What is PyTorch: python front end, C++ libraries (A10), Target devices libraries (cuDNN). These will be useful resources for this lab:

Installing PyTorch. Develop and debug locally with CPU/GPU on any system. Choose a CUDA version, important to be able to write generic code.
PyTorch docs

Part 1. Initial Template (1p)

We suggest the following steps to learn pytorch:

Get acquainted with torch.Tensor. It is much alike numpy.array, with the possibility to track the computation graph and implemented forward and backward functions for all differentiable array operations.

Prepare the training data. We will use MNIST classification dataset, for which a convenient access is available. The following code snippet (full template here) downloads the training dataset. Accessing the data as tensors is done with the torch.utils.data.Dataset class. Shuffling, and batching is done with the help of torch.utils.data.DataLoader.

import matplotlib.pyplot as plt
import numpy as np
 
import torch
import torchvision
import torchvision.transforms as transforms
 
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
 
# transforms
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
 
# datasets
trainset = torchvision.datasets.MNIST('./data', download=True, train=True, transform=transform)
 
# dataloaders
train_loader = torch.utils.data.DataLoader(trainset, batch_size=128, shuffle=True, num_workers=0)
 
# lets verify how the loader packs the data
(data, target) = next(iter(train_loader))
# probably get [batch_size x 1 x 28 x 28]
print('Input  size:', data.size())
# probably get [batch_size]
print('Labels size:', target.size())
# see number of trainig data:
n_train_data = len(trainset)
print('Train data size:', n_train_data)

Define a neural network model using the higher level building blocks: torch.nn.Sequential, torch.nn.Linear , torch.nn.Conv2d, nn.ReLU. Start with a simple model for ease of debugging. Chose an optimizer, for example torch.optim.SGD. Create the training loop.
The basic variant, together with a simple network loss and optimizer may look as follows:

# network, expect input images 28* 28 and 10 classes
net = nn.Sequential(nn.Linear(28 * 28, 10))
 
# loss function
loss = nn.CrossEntropyLoss(reduction='none')
 
# optimizer
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
 
for epoch in range(10):
    # will accumulate total loss over the dataset
    L = 0
    # loop fetching a mini-batch of data at each iteration
    for i, (data, target) in enumerate(train_loader):
        # flatten the data size to [batch_size x 784]
        data_vectors = data.flatten(start_dim=1)
        # apply the network
        y = net.forward(data_vectors)
        # calculate mini-batch losses
        l = loss(y, target)
        # accumulate the total loss as a regular float number (important to sop graph tracking)
        L += l.sum().item()
        # the gradient usually accumulates, need to clear explicitly
        optimizer.zero_grad()
        # compute the gradient from the mini-batch loss
        l.mean().backward()
        # make the optimization step
        optimizer.step()
    print(f'Epoch: {epoch} mean loss: {L / n_train_data}')

Inspect the code and look up the classes used in pytorch docs. Why the network is so simple, where is softmax?
Refine the net architecture, using convolutions, max-pooling and a fully connected layer in the end. Report you architecture by calling print(net).

Part 2. Full Pipeline (5p)

Extend the blank above with the following:

Split the training set into training and validation sets by creating two loaders that use different disjoint portions of the initial training set. This can be done using torch.utils.data.SubsetRandomSampler to split the dataset and then passing the subsets to two loaders. Use 90% of the initial training set for training (gradient-based parameter optimization) and use 10% of the initial training set for validation.
During the training estimate and print also the training accuracy for the current batch. Limit the amount of print output to only 10 times per epoch at regular intervals with respect to the epoch progress.
After completing full epoch estimate and print the validation loss and validation accuracy using a similar loop using validation loader (no optimization steps).
Create a history dictionary with numpy arrays train_loss, train_acc, val_loss, val_acc. Each array should have size 2 x n, where n is the number of records in the respective history and each column (e.g. val_loss[:,i]) contains the epoch number and the measurement. Validation metrics are measured and recorded once per epoch. Training metrics are measured on batches and recorded after every batch, respectively their epoch values are fractional. Save this 'history' dict using pickle.dump at the end of each epoch. Also save you current model using torch.save(net.state_dict(), PATH) in the end of each epoch. Save a checkpoint of the model at epoch 100.
From Jupiter notebook (a different one if the training loop is in the notebook as well), load the saved history and make the following inline plots using matplotlib:
- Plot in the same axis: training loss, exponentially weighted average (EWA) of training loss and validation loss versus epochs (the training loss should have sub-epoch resolution).
- Plot of training accuracy, EWA of training accuracy and test accuracy versus epochs (the training accuracy should have sub-epoch resolution).

Please annotate axis and legend.

Train a model to some reasonable validation accuracy (>95%), by monitoring the training and validation metrics and adjusting hyper-parameters as necessary (architecture, batch size, learning rate, etc.). This dataset is really easy, shouldn't take you long.
Modify the code to automatically run on GPU if it is available. Follow pytorch CUDA best-practices. In the provided template notice the lines:
```
dev = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
...
net.to(dev)
...
data = data.to(dev)
```

Note the different syntax of moving to device a Tensor and a Model.

Report plots of training and validation progress as above.

Part 3. Analysis (4p)

Extend you visualization notebook as follows. Load the model at epoch 100.

Compute and report its test accuracy, using test dataset

testset = torchvision.datasets.MNIST('./data', download=True, train=False, transform=transform)
test_loader = torch.utils.data.DataLoader(testset, batch_size=128, shuffle=False, num_workers=0)

tSNE

For calculations in this and next assignments use numpy. Pytorch tensors can be converted to numpy arrays using x.cpu().numpy()

Using the loaded model and test data, compute the network features before the last linear layer (to remove last layer try slicing the Sequential container or deleting the last layer in it).
Using sklearn.manifold.TSNE (see also example) visualize the embedding of feature vectors. If they form well-separated clusters, the network has found a good representation, on which any classical algorithm (e.g. nearest neighbor classifier) can be applied.

- Report tSNE plot

Does the classifier rank well?

Consider that we are allowed to use “reject from recognition” option. If the classifier picks the class $\hat y_i = {\rm argmax}_y p(y|x_i; \theta)$ on test point $i$, let us call $c_i = p(y|x; \theta)$ its confidence. We will want to reject from recognition when we are not confident, i.e. when $c_i \leq \alpha$, where $\alpha$ is a confidence threshold. We will not decide this threshold, but study the performance for all possible thresholds.

a) Plot the number of errors as a function of the threshold value $\alpha$. Since we work with a finite sample of test data, the test error rate will only change when $\alpha$ crosses one of the $c_i$ values we have. So instead of doing very small steps on the threshold and recomputing the error rate each time anew, here's a better way to do it. Sort all the confidences $c_i$ in ascending order. Let $e_i = 1$ if $\hat y_i \neq y^*_i$, i.e. we make an error and $e_i = 0$ if $\hat y_i = y^*_i$. If $c_{(i)}$ is the sorted sequence of confidences with error indicators $e_{(i)}$ then we can compute the number of errors for accepted points with threshold alpha = c_i as the sum of values $e_{(i+1)},\dots e_{(n)}$. You can compute this sum as

 np.sum(e)-np.cumsum(e)

Set the range of the threshold from minimum to maximum $c_i$.

b) Plot the number of points rejected from recognition as a function of the threshold value $\alpha$. For this we need to just plot values $1$ to $n$ versus the sorted array $c_{(i)}$.

c) Plot the error rate of accepted points (number of errors versus number of points accepted for recognition). This just combines the data from a) and b). If the relative error declines, the classifier is ranking well (we are rejecting erroneous points and keeping correct ones).

Report plots a), b), c)

Table of Contents