Convolutional Neural Networks

In this lab, we will learn how to build and train Convolutional Neural Networks (CNN) using the MatConvNet framework. In particular, we will train several different CNNs for the recognition of handwritten digits (0-9) from 28×28 grayscale images.

We will build the network step-by-step demonstrating various techniques used nowadays as well as several caveats one should be aware of.

There are many resources available today for learning more about deep learning and CNNs in particular. You can find a few you may find interesting at the bottom of this page. However, hold on reading them all first. We will go slowly, adding one component after another and when you are finally set to do your own coding, you may want to revisit these links or find your own lectures, tutorials, videos, … We added links to relevant explanations into the text itself, so whenever you meet a new term, you can learn its meaning.

The output of this assignment will be a single file my_cnn.mat containing a trained CNN digit classifier and the scripts used for training packed together in a single .zip file. See the template code for a way how to save the network.

The challenge: The network will be automatically evaluated by the upload system and ranked at the online score board. The gained points from the assignment depend on the rank of the algorithm:

1st place: 16 points
2nd place: 14 points
3rd place: 12 points
4th place: 10 points
every submission with performance worse than the baseline: 0 points
8 points otherwise

Deadline for the submission is Sun Jan 7 23:59. Then the points will be assigned. Every later submission is then for 6 points. You have to be better than the baseline to complete this lab.

Start by downloading the template and the training data (97MB).

MatConvNet Instalation

There are several popular frameworks for training CNNs (Caffe, TensorFlow, Torch, Theano). Feel free to explore and try them after mastering this lab. For simplicity we will stay within the Matlab environment and will use the MatConvNet framework.

Download and unzip matconvnet-1.0-beta23.tar.gz. Start a Matlab session and type the following commands in order to go into the unzipped directory, compile MEX-files and set necessary paths.

>> cd matconvnet-1.0-beta23
>> run matlab/vl_compilenn
>> run matlab/vl_setupnn

For Windows, the compiled .mex files are here.

Data description

We have prepared an image database MAT-file (imdb.mat) with images and corresponding labels from MNIST dataset hosted on Yann Le Cun's website. It can be loaded using load imdb.mat command. It creates a structure imdb in Matlab with the following fields:

imdb.images.data contains 28x28x1 dimensional images concatenated in 4D array.
imdb.images.labels contains corresponding labels (i.e. labels 1-10, which correspond to digits 0-9).
imdb.images.data_mean contains mean which has been subtracted from original images to make training easier.

Keep the last 10000 images (50001:60000) for validation and train the networks on the first 50000 images only (1:50000). This way one can test the training for overfitting on an independent (not seen during training) set. You may reduce the training set size further if the training runs too slow on your computer. However, the less data you use, the worse results you may expect (this is the golden rule of deep learning).

1. Linear classifier, softmax regression, stochastic gradient descent

We start with a simple one layer fully-connected network to demonstrate the basic principles. First, in contrast to previous algorithms, neural networks extend to the multi-class case quite naturally. This is an extension of the logistic regression.

Lets denote the images as $\mathbf{x}\in\mathcal{R}^{28\times 28}$ and the labels as $y\in \{0,1\dots,9\}$. In the binary logistic regression the probability that an image $\mathbf{x}$ has label $y=1$ was modelled by $$ P(y=1|\mathbf{x},w) = \frac{1}{1+\mathrm{exp}(\mathbf{w}^\top\mathrm{vec}(\mathbf{x}))} $$ The softmax regression generalises this expression to $K$ classes. The probability that an image $\mathbf{x}$ has a label $y=k$ is modelled as $$ P(y=k|\mathbf{x},w) = \frac{\mathrm{exp}(\mathbf{w}_k^\top\mathrm{vec}(\mathbf{x}))}{\sum_{j=1}^{K}\mathrm{exp}(\mathbf{w}_j^\top\mathrm{vec}(\mathbf{x}))} $$ where $\mathbf{w}_k$ is the vector of weights of $k$-th output neuron.

As shown in the lecture slides, the go-to learning algorithm in neural networks is the back-propagation using gradient descent. If you study the formulas carefully, they contain a sum over all training examples. With increasing the training set size (which we want for difficult problems), this becomes prohibitive. Instead, one uses stochastic gradient descent (SGD).

The basic idea of the stochastic gradient descent is simple: Take a smaller portion of the data (called a batch) and use it to estimate the gradient. This way only an estimate of the gradient is obtained. Using it to update the network weights is thus not optimal, but generally guides the optimisation in the right direction.

Task:

Prove that the softmax regression is a generalisation of the logistic regression, i.e. show that for a binary classification problem ($K$=2) the softmax model is equivalent to the logistic model. (Only when desperate, have a look at the solution)

The template contains a code for training a simple one layer network with a softmax regression on the output and trained using the stochastic gradient descent. MatConvNet stores the layers of a neural network in a structure array. For example a convolutional layer, which computes convolution of the input image (input_rows x input_cols x input_channels) with N convolutional kernels (N=output_channels) is initialised as follows:

>> net.layers{1} = struct('type', 'conv', ...
            'weights', {{1e-2*randn(input_rows,input_cols,input_channels, output_channels,'single'), randn(1,output_channels,'single')}}, ...
            'stride', 1, ...
            'pad', 0) ;

The second randn in the weights initialisation is for the bias term ($y=wx+b$). In our case, we want to build a neural network with a single fully connected layer. This is implemented in MatConvNet as a CNN layer configured to act as a fully-connected layer with input of size 28x28x1 and with 10 outputs corresponding to probabilities $P(x|y=k)$.

In order to train a CNN, a loss function must be defined. MatConvNet implements the loss function as a special layer (e.g. see the softmax layer).

Tasks:

Train the network on the MNIST data.

>> [net, info] = cnn_train(net, imdb, @getSimpleNNBatch, 'batchSize', 1000, 'numEpochs', 100, 'expDir', 'expDir');

It is always a good idea to train another standard classifier (e.g. SVM) on the same data to get some baseline results. If the network does not perform better, something is wrong. Try to train an SVM classifier with the same data.
Add another fully-connected layer with 1000 hidden units and with sigmoid non-linearity. Do you get better results?.
Try to add more layers. Does the network improve with more and more layers?
How many weights do you learn in each case?
Try to substitute the sigmoid non-linearity with Rectified Linear Unit (ReLU). It helps to avoid the vanishing gradient problem.
Experiment with the number of layers, number of hidden units and try to get the best possible result.

2. Convolutional Neural Networks

One of the main disadvantages of using the fully-connected layers on images is that they do not take into account the spatial structure of the image. Imagine that you randomly perturb spatial arrangement of image pixels (in both training and test data) and re-train the fully-connected network. These perturbed images become completely unlearnable for humans, since humans make a prior assumption about the spatial arrangement. Nevertheless, the expected testing error of the re-trained network on this randomly perturbed dataset will be the same, since it does not make any assumptions and learns the spatial arrangement from the scratch from perturbed training data. When we learn on images, architecture of the network should reflect the particular spatial arrangement.

We impose the spatial arrangement by introducing the convolutional layers. The convolution works as shifting a local template (often called a convolution kernel or a local receptive field) over the image and computing its response for every single position in the image. For example, when the input image is 28×28, and we compute convolution with 5×5 kernel, than resulting response image will be 24×24 (unless we pad the image with zeros). When learned, these templates often corresponds do edge or corner detectors.

Another disadvantage of the fully-connected layers is that the number of parameters grows quickly with new layers. This means significantly more parameters need to be learned and thus more data need to be used to avoid overfitting.