Lab 3: CNN, Pytorch Workflow

CNN MNIST, Pytorch project workflow, training/validation/test set, hyperparameters, single loop learning rate selection; evaluation: accuracy, ranking, t-SNE embedding.

The basics are similar to PyTorch tutorial .

Initial Template

We suggest the following steps to learn pytorch:

We will use MNIST classification dataset.

The provided template.zip includes the basic data loading code.

 
class Data():
    def __init__(self, args):
        # transforms
        transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])        
        self.train_set = torchvision.datasets.MNIST('../data', download=True, train=True, transform=transform)
        self.test_set = torchvision.datasets.MNIST('../data', download=True, train=False, transform=transform)
        # dataloaders
        self.train_loader = torch.utils.data.DataLoader(self.train_set, batch_size=args.batch_size, shuffle=True, num_workers=0)
        self.test_loader = torch.utils.data.DataLoader(self.test_set, batch_size=args.batch_size, shuffle=True, num_workers=0)
        # Task: split train_set into train_set and val_set and create loaders

The Dataset class is responsible for knowing how to access the data and DataLoader is responsible for shuffling, (parallel) loading, and batching. Writing own Dataset class is rather simple, e.g.

class DataXY(torch.utils.data.Dataset):
    def __init__(self, X: torch.Tensor, Y: torch.Tensor, transform=None):
        self.X = X.to(dev)
        self.Y = Y.to(dev)
        self.transform = transform
 
    def __getitem__(self, index):
        x, y = self.X[index], self.Y[index]
        if self.transform is not None:
            x = self.transform(x)
        return x, y
 
    def __len__(self):
        return self.X.size(0)

Define a neural network model using the higher level building blocks: torch.nn.Sequential, torch.nn.Linear , torch.nn.Conv2d, nn.ReLU. Start with a simple model for ease of debugging. Chose an optimizer, for example torch.optim.SGD. Create the training loop.
The basic variant, together with a simple network loss and optimizer are already implemented in the template. The network and the loss look like this
```
# network, expect input images 28* 28 and 10 classes
net = nn.Sequential(nn.Flatten(), nn.Linear(28 * 28, 10))
loss = nn.CrossEntropyLoss(reduction='none')
```
Inspect the code and look up the classes used in pytorch docs. Why the network is so simple, where is softmax?
Refine the net architecture, using convolutions, max-pooling and a fully connected layer in the end. Report you architecture by calling print(net). Note, this is a very small dataset, with small input images. Do not use a full-blown architecture such as VGG that we considered in the lecture. Invent a small convolutional architecture of your own or get inspiration from the famous LeNet5 model.

Part 1. Full Workflow (6p)

Extend the template with the following:

Split the training set into training and validation sets by creating two loaders that use different disjoint portions of the initial training set. This can be done using torch.utils.data.SubsetRandomSampler to split the dataset and then passing the subsets to two loaders. Use 90% of the initial training set for training (gradient-based parameter optimization) and use 10% of the initial training set for validation. When you do the splitting do not assume that the training set is already shuffled: a common novice mistake is to withhold the last 10000 images.
Implement the learning rate “range test” described in the paper Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates in Figure 2a. The model needs to be trained for one epoch while changing the learning rate after each batch. Let us start with a small learning rate of $10^{-5}$ and with each batch increase it by a small factor so that by the end of the epoch we reach the learning rate of $0.1$. Monitor the stochastic training loss with each batch and plot the graph similar to Figure 2a. Visually determine when the value of the learning rate where the graph shows the minimum loss. Select twice smaller learning rate for training.
Create a history dictionary with numpy arrays train_loss_batch, train_acc_batch and log there the training loss and training accuracy for each batch. These we will save for further processing at the visualization time. Pytorch tensors can be converted to numpy arrays using x.detach().cpu().numpy()
In the end of each epoch print the training loss and accuracy computed by summing losses and errors from all batches as measured during the training and dividing by the total number of training samples. Since we draw data without replacement, this should be the average loss and accuracy on the training set (although with the hysteresis effect because the parameters were updated at each batch).
In the end of each epoch compute the validation loss and validation accuracy using the validation loader (no optimization steps). Save these values to val_loss, val_acc numpy arrays in the history dict. Unlike training metrics, validation metrics are measured and recorded once per epoch. Save the history dict using pickle.dump at the end of each epoch. The file name can be the string containing all arguments passed to the program, like “–lr 0.01 –optimizer Adam”. This would allow you to visualize the current learning progress and compare several runs.
Also save you current model in the end of each epoch. Saving each epoch will allow you to plot partial results.
We will visualize the training progress in the view.ipynb notebook. Load there your saved history and make the following inline plots using matplotlib:
- Plot in the same axis: i) training loss as recorder per batch, ii) exponentially weighted average (EWA) of training loss per batch and 3) the validation loss recorder per epochs. For the training losses, the x coordinates for the plot need to be rescaled to be displayed in the epoch units. Use logarithmic axis for the losses. Empirically find the value of the smoothing parameter $q$ that reflects the trend and does not have too much hysterezis.
- Make a similar plot for the training accuracy: raw training accuracy, EWA of training accuracy and the validation accuracy.
Train the model for 30 epochs. This dataset is really easy, it shouldn't take you long to get accuracy above 98%. Try also twice smaller learning rate, compare the results.
Report plots of the range test, the training and validation plots as above. Please annotate axis and legend.

Part 2. Additional Analysis (4p)

Extend you visualization notebook as follows. Load the model saved at epoch 30.

Compute and report its test accuracy using test dataset.
Print the class confusion matrix. It is a matrix of size $C \times C$ for $C$ classes. Its $ij$ entry contains the number of data samples whos true class was $j$ and the predicted class was $i$. Determine the pair of digits in the data which are the most confusing for your classifier.

tSNE

Using the loaded model and test data, compute the network features before the last linear layer (we assume you have more than one layer). To remove last layer try slicing the Sequential container or deleting the last layer in it.
Using sklearn.manifold.TSNE (see also example) visualize the embedding of feature vectors. If they form well-separated clusters, the network has found a good representation of the data. In this representation any classical algorithm (e.g. nearest neighbor classifier) can be applied.

- Report tSNE plot

Does the classifier rank well?

Often (especially when trained long enough), deep models have good classification accuracy but the predictive probabilities $p(y|x)$ are not well calibrated (typically overconfident), On Calibration of Modern Neural Networks . Yet, more confident predictions tend to be more accurate. If this is the case, we will say that the classifier ranks well.

Consider that we want to use the classifier with the “reject from recognition” option. If the classifier picks the class $\hat y_i = {\rm argmax}_y p(y|x_i; \theta)$ on test point $i$, let us call $c_i = p(y|x; \theta)$ its confidence. It is desirable to reject from recognition when the classifier is not sufficiently confident, i.e. when $c_i \leq \alpha$, where $\alpha$ is a confidence threshold. We will study the dependence of the performance on the confidence threshold. Note that ordering by confidences is not the same as ordering by scores (why?).

a) Plot the absolute number of errors as a function of the threshold value $\alpha$. Since we work with a finite sample of test data, the test error rate will only change when $\alpha$ crosses one of the $c_i$ values we have. So instead of doing very small steps on the threshold and recomputing the error rate each time anew, here's a better way to do it. Sort all the confidences $c_i$ in ascending order. Let $e_i = 1$ if $\hat y_i \neq y^*_i$, i.e. we make an error and $e_i = 0$ if $\hat y_i = y^*_i$. If $c_{(i)}$ is the sorted sequence of confidences with error indicators $e_{(i)}$ then we can compute the number of errors for accepted points with threshold alpha = c_i as the sum of values $e_{(i+1)},\dots e_{(n)}$. You can compute this sum as

 np.sum(e)-np.cumsum(e)

Set the range of the threshold from the minimum to maximum $c_i$. This plot is an intermediate result.

b) Plot the relative error rate of accepted points (the number or errors divided by the number of points accepted for recognition) versus the number of points rejected from recognition when the threshold is varied. If the relative error declines, the classifier is ranking well (we are rejecting erroneous points and keeping correct ones). Report the plots a), b)

Table of Contents