Search
In this lab, we will learn how to build and train Convolutional Neural Networks (CNN) using the MatConvNet framework. In particular, we will train several different CNNs for the recognition of handwritten digits (0-9) from 28×28 grayscale images.
We will build the network step-by-step demonstrating various techniques used nowadays as well as several caveats one should be aware of.
There are many resources available today for learning more about deep learning and CNNs in particular. You can find a few you may find interesting at the bottom of this page. However, hold on reading them all first. We will go slowly, adding one component after another and when you are finally set to do your own coding, you may want to revisit these links or find your own lectures, tutorials, videos, … We added links to relevant explanations into the text itself, so whenever you meet a new term, you can learn its meaning.
The output of this assignment will be a single file my_cnn.mat containing a trained CNN digit classifier and the scripts used for training packed together in a single .zip file. See the template code for a way how to save the network.
my_cnn.mat
.zip
The challenge: The network will be automatically evaluated by the upload system and ranked at the online score board. The gained points from the assignment depend on the rank of the algorithm:
Deadline for the submission is Sun Jan 7 23:59. Then the points will be assigned. Every later submission is then for 6 points. You have to be better than the baseline to complete this lab.
Start by downloading the template and the training data (97MB).
There are several popular frameworks for training CNNs (Caffe, TensorFlow, Torch, Theano). Feel free to explore and try them after mastering this lab. For simplicity we will stay within the Matlab environment and will use the MatConvNet framework.
Download and unzip matconvnet-1.0-beta23.tar.gz. Start a Matlab session and type the following commands in order to go into the unzipped directory, compile MEX-files and set necessary paths.
>> cd matconvnet-1.0-beta23 >> run matlab/vl_compilenn >> run matlab/vl_setupnn
For Windows, the compiled .mex files are here.
.mex
We have prepared an image database MAT-file (imdb.mat) with images and corresponding labels from MNIST dataset hosted on Yann Le Cun's website. It can be loaded using load imdb.mat command. It creates a structure imdb in Matlab with the following fields:
imdb.mat
load imdb.mat
imdb
imdb.images.data
imdb.images.labels
imdb.images.data_mean
Keep the last 10000 images (50001:60000) for validation and train the networks on the first 50000 images only (1:50000). This way one can test the training for overfitting on an independent (not seen during training) set. You may reduce the training set size further if the training runs too slow on your computer. However, the less data you use, the worse results you may expect (this is the golden rule of deep learning).
We start with a simple one layer fully-connected network to demonstrate the basic principles. First, in contrast to previous algorithms, neural networks extend to the multi-class case quite naturally. This is an extension of the logistic regression.
Lets denote the images as $\mathbf{x}\in\mathcal{R}^{28\times 28}$ and the labels as $y\in \{0,1\dots,9\}$. In the binary logistic regression the probability that an image $\mathbf{x}$ has label $y=1$ was modelled by $$ P(y=1|\mathbf{x},w) = \frac{1}{1+\mathrm{exp}(\mathbf{w}^\top\mathrm{vec}(\mathbf{x}))} $$ The softmax regression generalises this expression to $K$ classes. The probability that an image $\mathbf{x}$ has a label $y=k$ is modelled as $$ P(y=k|\mathbf{x},w) = \frac{\mathrm{exp}(\mathbf{w}_k^\top\mathrm{vec}(\mathbf{x}))}{\sum_{j=1}^{K}\mathrm{exp}(\mathbf{w}_j^\top\mathrm{vec}(\mathbf{x}))} $$ where $\mathbf{w}_k$ is the vector of weights of $k$-th output neuron.
As shown in the lecture slides, the go-to learning algorithm in neural networks is the back-propagation using gradient descent. If you study the formulas carefully, they contain a sum over all training examples. With increasing the training set size (which we want for difficult problems), this becomes prohibitive. Instead, one uses stochastic gradient descent (SGD).
The basic idea of the stochastic gradient descent is simple: Take a smaller portion of the data (called a batch) and use it to estimate the gradient. This way only an estimate of the gradient is obtained. Using it to update the network weights is thus not optimal, but generally guides the optimisation in the right direction.
Task:
The template contains a code for training a simple one layer network with a softmax regression on the output and trained using the stochastic gradient descent. MatConvNet stores the layers of a neural network in a structure array. For example a convolutional layer, which computes convolution of the input image (input_rows x input_cols x input_channels) with N convolutional kernels (N=output_channels) is initialised as follows:
>> net.layers{1} = struct('type', 'conv', ... 'weights', {{1e-2*randn(input_rows,input_cols,input_channels, output_channels,'single'), randn(1,output_channels,'single')}}, ... 'stride', 1, ... 'pad', 0) ;
randn
In order to train a CNN, a loss function must be defined. MatConvNet implements the loss function as a special layer (e.g. see the softmax layer).
Tasks:
>> [net, info] = cnn_train(net, imdb, @getSimpleNNBatch, 'batchSize', 1000, 'numEpochs', 100, 'expDir', 'expDir');
One of the main disadvantages of using the fully-connected layers on images is that they do not take into account the spatial structure of the image. Imagine that you randomly perturb spatial arrangement of image pixels (in both training and test data) and re-train the fully-connected network. These perturbed images become completely unlearnable for humans, since humans make a prior assumption about the spatial arrangement. Nevertheless, the expected testing error of the re-trained network on this randomly perturbed dataset will be the same, since it does not make any assumptions and learns the spatial arrangement from the scratch from perturbed training data. When we learn on images, architecture of the network should reflect the particular spatial arrangement.
We impose the spatial arrangement by introducing the convolutional layers. The convolution works as shifting a local template (often called a convolution kernel or a local receptive field) over the image and computing its response for every single position in the image. For example, when the input image is 28×28, and we compute convolution with 5×5 kernel, than resulting response image will be 24×24 (unless we pad the image with zeros). When learned, these templates often corresponds do edge or corner detectors.
Another disadvantage of the fully-connected layers is that the number of parameters grows quickly with new layers. This means significantly more parameters need to be learned and thus more data need to be used to avoid overfitting.
>> res = vl_simplenn(net, data, [], [], 'Mode', 'test');