Convolutional Neural Networks

Deep Convolutional Neural Networks (CNNs) re-entered into the computer vision community recently, especially after the breakthrough paper of Krizhevsky et al. [1] that presented a large scale image category recognition with a remarkable success. In 2012, the CNN-based algorithm outperformed competing teams from many renowned institutions by a significant margin. This success initiated an enormous interest in neural networks in computer vision, to the extent that most successful methods are using neural networks nowadays.

The convolutional network is an extremely flexible classifier that is capable of fitting on very complex recognition/regression problems with a good generalization ability. The network consists of a nested ensemble of non-linear functions. The network is usually deep, i.e. it has many layers. Typically it has more parameters than number of data samples in the training set. There are mechanism to prevent overfitting. One of the basic tricks is leveraging the convolutional layers. The network learns shift-invariant filters instead of individual weights on every input pixel. Thus much fewer parameters are required, since the weights are shared.

cnn_arch.jpg Fig. 1: Architecture of a Deep Convolutional Neural Network. Figure adapted from [1].

Usually, the architecture of an image classification CNN is composed of several convolutional layers (which are meant to learn a representation) followed by a few fully connected layers (which implement the non-linear classification stage on top of the invariant representation), see figure 1.

In the following two labs, you will get in touch with the CNNs. In the first part, you will work with a pre-trained network, while in the second part you will train your own network from scratch. We recommend to use MatConvNet toolbox from Oxford [2].

1. Working with pre-trained network

Download and installation

Download, that contains all necessary files: MatConvNet toolbox, EdgeBoxes toolbox, pre-trained models, and test images and test scripts. Extract the content of the archive into a separate directory.

The main script is test.m. First, the script initializes the MatConvNet toolbox and compiles it if necessary on your machine. You will be later supposed to code a couple of functions which are called by the test script.

Loading pre-trained network and understanding the architecture

We will use model imagenet-vgg-f. This is a fast and one of the best performing network for large scale image categorization. It was trained to recognize 1000 classes from ImageNet. The network has the same architecture as the original network proposed in [1], but was trained from scratch by MatConvNet authors.

%% Load pre-trained CNN model
model = 'imagenet-vgg-f' ;
net = load(sprintf('models/%s.mat', model)) ;
%display net structure
%display filters
filter_img = vl_imarraysc(net.layers{1}.weights{1});
title('First layer filters')

The above code loads the model and displays the architecture by layers and visualizes the first layer filters, figure 2. Make sure you understand the meaning and functionality of all the layers input, conv, relu, mpool, softmx. Refer to the user's manual if you are unsure.

imagenet_detect_01.jpg Fig. 2: First layer filters.

Entire image classification

A test image below is first normalized to fixed size 224×224 pixels and the average image (over the training set) is substracted.

I = imread('grocery.jpg');
imagesc(I); axis image
title('Input image')
%normalize image
im = imresize(I, net.meta.normalization.imageSize(1:2));
im = single(im) - net.meta.normalization.averageImage;
%run network
res = vl_simplenn(net, im, [], [], 'mode', 'test');

Then the network is executed and all the responses of the network including the final output are stored in structure res. The test script shows responses of the first layer filters. Classification scores for all 1000 classes are found in the last layer. The code bellow prints the top 5 scoring classes.

%gather results
r = squeeze(gather(res(end).x));
[rs, id] = sort(r, 'descend');
for i=1:5
    fprintf('%.3f %s \n', rs(i), net.meta.classes.description{id(i)});

The output should be:

0.380 grocery store, grocery, food market, market 
0.210 pineapple, ananas 
0.147 banana 
0.102 custard apple 
0.038 strawberry 

Feel free to switch an input image to one of other attached test images or to your favorite image. You should get an intuition what the network can recognize and where limitations are.

Scanning-window detection

The network correctly classified the above image. However, the image apparently contains multiple objects of various categories. The most straightforward approach to detect multiple classes is to use scanning windows.

The idea is that an image is exhaustively scanned with windows that defines a sub-image. All the sub-images are cropped and normalized to the fixed size that is passed into the network. A problem is that there are too many of all possible sub-images. Luckily, we need not evaluate on all of them, since the network is to some extent insensitive on a precise alignment of an object in the image. Therefore, we can scan the image with a small overlap among scanning windows.

Your task will be to write function scanning_windows.m that takes the input image, a minimum size of the scanning window, a stride of the scan and a multi-scale factor and outputs a list of square bounding boxes. See the function template for an exact format.

The set of scanning window bounding boxes will be used by the test script to prepare a batch of images that is then fed into the network. Top scoring classes of each of the bounding box are collected and scores above a threshold are displayed by show_detections.m. Note that you can click on an object to highlight the bounding box and the textual description which might be useful in case of multiple overlapping detections.

Having scores of all 1000 classes in all bounding boxes, the test script shows a response map of a particular class (over all scanned locations and all scales); e.g. a pineapple in the figure below.

Detection with EdgeBoxes

A drawback of the exhaustive scanning window approach is high computational cost, since even homogeneous (textureless) regions of the images are evaluated. An attempt to avoid expensive search is using the Edge Boxes [3]. The Edgeboxes deliver a list of promising bounding boxes, where an object is likely to be present. The algorithm should work independently on the object class. It is based on a simple idea that an object bounding box has many edges that are contained inside the bounding box but very few edges crossing the bounding box boundary.

The following code runs the EdgeBoxes (using author's implementation), and transforms the output into the same format as the scanning-window bounding boxes.

addpath dep/edges-master
model=load('dep/edges-master/models/forest/modelBsds'); model=model.model;
model.opts.multiscale=0; model.opts.sharpen=2; model.opts.nThreads=4;
% set up opts for edgeBoxes (see edgeBoxes.m)
opts = edgeBoxes;
opts.alpha = .65;     % step size of sliding window search
opts.beta  = .75;     % nms threshold for object proposals
opts.minScore = .01;  % min score of boxes to detect
opts.maxBoxes = 1e4;  % max number of boxes to detect
% detect Edge Box bounding box proposals (see edgeBoxes.m)
tic, fprintf('Generating EdgeBoxes...')
bbs=edgeBoxes(I,model,opts); toc
bboxes = double([bbs(:,1), bbs(:,2), bbs(:,1)+bbs(:,3), bbs(:,2)+bbs(:,4)]');

After the output EdgeBoxes are extracted, normalized, fed into the network, the result should look as follows.

Many bounding boxes are overlapping each other that makes the result a bit chaotic. Your task will be to implement a simple algorithm that selects the highest scoring detections while suppressing those with a lower score that are overlapping them. Implement function stable_detections.m that take a list of bounding boxes (delivered by edge boxes), a corresponding list of scores (given by the network output), and an overlap IoU threshold (intersection over union ratio) as a parameter, and outputs an index of bounding boxes that are finally selected. Refer to the function template for an exact specification.

The stable detection algorithm proceeds as follows:

  1. Sort bounding box scores in the descending order (creating a queue)
  2. While the queue is not empty
    1. Take the top-scoring bounding box from the queue, add it to the solution subset
    2. Find overlapping competitors (with lower score) based on IoU overlap
    3. Remove the overlapping competitors from the queue.
  3. End

The final result should look similar as in the figure below:

What should you upload?

You are supposed to upload functions scanning_windows.m and stable_detections.m together with all used non-standard functions you have created.


To test your codes, run test_publish.m that calls test.m, the main test script and generates a html page. Compare your results with ours.

2. Training own network

In this lab, we will experiment with convolutional network training for hand-written digit recognition. Neural networks have been the state of the art in this task for a long time [4], using convolutional layers.

Download and installation

Download that contains all necessary files: the MNIST dataset of labeled hand-written digits and the test.m script. Extract the content into a separate directory at the same level as the previous lab. We will use again the MatConvNet toolbox, which has been installed last time.

The MNIST dataset

The MNIST dataset is loaded into structure imdb. The dataset contains 70k of labeled images of size 28×28 pixels. The training/test split is already made in imdb.images.set, where 1-training, 2-test (validation).

A sample from the dataset is shown here:

Training the baseline network

A structure of the network needs to be set up first. This code initializes the network structure and prints the structure by layers.

net = cnn_mnist_init(); %set up the network structure

The network is relatively shallow, much simpler than the ImageNet category network, but has proven excellent performance.

     layer|    0|   1|    2|    3|   4|    5|    6|   7|    8|   9|  10|     11|
      name|  n/a|    |     |     |    |     |     |    |     |    |    |       |
   support|  n/a|   5|    1|    2|   5|    1|    2|   4|    1|   1|   1|      1|
  filt dim|  n/a|   1|  n/a|  n/a|  20|  n/a|  n/a|  50|  n/a| n/a| 500|    n/a|
 num filts|  n/a|  20|  n/a|  n/a|  50|  n/a|  n/a| 500|  n/a| n/a|  10|    n/a|
    stride|  n/a|   1|    1|    2|   1|    1|    2|   1|    1|   1|   1|      1|
       pad|  n/a|   0|    0|    0|   0|    0|    0|   0|    0|   0|   0|      0|
   rf size|  n/a|   5|    5|    6|  14|   14|   16|  28|   28|  28|  28|     28|
 rf offset|  n/a|   3|    3|  3.5| 7.5|  7.5|  8.5|14.5| 14.5|14.5|14.5|   14.5|
 rf stride|  n/a|   1|    1|    2|   2|    2|    4|   4|    4|   4|   4|      4|
 data size|   27|  23|   23|   11|   7|    7|    3|   0|    0|   0|   0|      0|
data depth|    1|  20|   20|   20|  50|   50|   50| 500|  500| 500|  10|      1|
  data num|    1|   1|    1|    1|   1|    1|    1|   1|    1|   1|   1|      1|
  data mem|  3KB|41KB| 41KB|  9KB|10KB| 10KB|  2KB|  0B|   0B|  0B|  0B|     0B|
 param mem|  n/a| 2KB| 320B|   0B|98KB| 800B|   0B| 2MB|  8KB|  0B|20KB|     0B|
parameter memory|2MB (4.3e+05 parameters)|
     data memory|116KB (for batch size 1)|

In the following, we will keep the network structure fixed. The weights were initialized randomly. The training process will now optimize the soft-max loss (empirical multi-class error) iteratively by stochastic gradient descent (SGD), a.k.a. back-propagation.

trainOpts = [];
trainOpts.batchSize = 100 ; %number of images in the SGD step
trainOpts.numEpochs = 15 ;  %number of iterations over all data samples
trainOpts.continue = true ; %resume if true
trainOpts.learningRate = 0.001 ; %scalar that scales the gradient
trainOpts.expDir = 'mnist/baseline' ; %working directory
% Call training function in MatConvNet
[net,info] = cnn_train(net, imdb, @getBatch, trainOpts) ;

Function cnn_train takes the network (with initial weights) net, the image dataset imdb, function that extracts the batch of images with labels getBatch, together with training options trainOpts.

During the training, several statistics are measured after every batch. The toolbox plots the objective and top1-error and top5-error for both training and validation data after each epoch. The training takes a couple of minutes, and finally the training curves will look similar to the following:

Our network achieved 0.014 validation error. Now, let us try to use the network to read a hand-written phone number.

A scanning window that exhaustively evaluates all possible positions of the digits in the image is implemented. The search is horizontal only. Similarly as the last time, we prepare a stack of images that is fed into the trained network.

In the above visualization, we see the response map of all the characters, the best score labels and the best score in each position of the sliding window. The results are not as good as you might have expected. What is wrong?

Obviously, the network cannot recognize blank space and consistently outputs digit '1' instead as the sparsest digit. Moreover, while the network was trained on isolated digits, scanning windows often contain adjacent characters and the network is thus confused with the context. This is an example of both digits '5' that are recognized usually as digit '8'.

Training the context robust network

To measure quantitatively the network performance in the surrounding of other digits, we prepared a small set of 1000 digits with the simulated context. A sample is shown below.

Although the error rate of the baseline network for the isolated recognition was only 0.014, the error rate climbs to 0.079 on the context dataset.

A remedy is to train the network to recognize the digits in the context. We will train a new network that will have one more class for blank space and that will be given labeled examples with the simulated context when training to be insensitive to it. This approach is called the data augmentation. The following code will do the training:

% Add "space" character to capture blank space,:,:,end:end+5000) = 0;
imdb.images.labels(end:end+5000) = 11;
imdb.images.set(end:end+3000) = 1;
imdb.images.set(end:end+2000) = 2;
imdb.meta.classes{end+1} = ' '
net = cnn_mnist_init('num_classes', 11);  %one more class added
trainOpts = [];
trainOpts.batchSize = 100 ;
trainOpts.numEpochs = 15 ;
trainOpts.continue = true ;
trainOpts.learningRate = 0.001 ;
trainOpts.expDir = 'mnist/context' ;
% Call training function in MatConvNet
[net,info] = cnn_train(net, imdb, @getBatchWithContext, trainOpts) ;

Your task will be to implement function getBatchWithContext which replaces the original getBatch of the baseline network and delivers images with simulated context. We recommend to compose the simulated images with random adjacent digits (from the same batch) and randomly tight margins between the characters. Your images should look similar to those in our context test set.

After the training, the result on the context set should improve. You should achieve the error rate in the context set around 0.018, while keeping around 0.013 error rate for the isolated recognition.

Results should also improve for scanning windows on the phone number image. Notice, that besides the spaces correctly found, there is much less confusion in the response map and the digit's are always correctly recognized when the window is well aligned with the digit.

Hand-written Phone Number Recognition Contest

The above image shows a row of 9 digits, that could be a hand-written phone number. Previous experiments show promising results in the network ability to recognize the isolated digits or the digits perturbed by surrounding context. However, a practical task would be to read the number from the input image, i.e. to design an algorithm that would take an input image and output a string of 9 recognized digits. This will be your task.

Download dataset phone_numbers.mat. The dataset contains two variables: images (28x250x1000) which contains 1000 images of phone numbers similar to the one above, and labels (1000×9) char array of corresponding image labels. More precisely, each row of this array correspond to the recognition of all 9 digits.

The true labels are given for the first 50 images only, your task is to complete labels for the remaining 950 images (denoted by '?' symbol). It is completely up to you how you approach the problem. To motivate you to come up with a high quality solution, you will compete with your colleagues in a contest. You will be awarded with bonus points depending of your success.

We know the ground-truth labels for the 950 unlabeled images. We will measure your average number of digit misclassifications over the entire set of phone number images. I.e. the best possible error is 0 if no mistake is made, while the theoretical maximum is 9 for all the digits always recognized incorrectly.

You are supposed to upload into task 11_contest a zip-archive that contains the following files:

  1. results.mat with your results in variable 'labels' (char array of 1000×9).
  2. approach.txt one paragraph of plain text briefly describing your approach. Do not forget to mention the computational time.
  3. contest.m (+ all non-standard functions) that generates file results.mat. Note that the code will not be executed by the upload system before the deadline to save computational time. However, the code must generate exactly the same output results.mat as you uploaded. Your script must not run any training of the network. Upload a pre-trained network (as a mat-file) that is loaded by your script instead.

It is important that you strictly preserve the upload format, since the error is computed automatically immediately after you upload the results. The system will report your error. You can also see the current leaderboard, notice you must be logged in BRUTE in order to see the scoring table. You can re-upload up to 100 times before the deadline, however not after the deadline, when the contest is finished. Only your last results are considered for the contest.

The evaluation. You will get points based on the final ranking in the leaderboard. The winner, i.e. the student with rank 1 achieving minimum error, will get 7 points. Each next rank will get one point less (rank 2 - 6p, rank 3 - 5p, …, rank 7 - 1p etc.). Negative points are not given. All students who will upload results achieving the mean error better by 20% than the random guess will get 1 point, otherwise no point is given.

The rules. Please, read carefully following rules, since a violation may result in a disqualification.

  • Each participant may upload up to 100 times before the deadline, but not more and not later. The submission system will be closed strictly after reaching 100 submission attempts and will be closed immediately after the deadline.
  • The points are given based on the final ranking of the leaderboard after the deadline.
  • In case of multiple participants achieve equal mean errors, the lower rank will be decided for the participants with the lower number of upload attempts. In case the participants having the same both mean errors and the number of upload attempts, then the best rank is decided to all those participants equally (with equal corresponding points).
  • Wrong format submissions are not accepted.
  • No manual processing (hand-labeling) is allowed. Results must be produced fully automatically and the code generating the results must be provided.
  • In case the code fails or generates the results that are not identical to those of 'results.mat', a participant is disqualified.
  • A third party code can be used, but end-to-end solutions are not allowed. Consult with your teacher if in doubts.

What should you upload?

You are supposed to upload function getBatchWithContext.m and your trained networks as files mnist/baseline/net-epoch-X.mat and mnist/context/net-epoch-Y.mat, where X and Y are the epochs where the best performance is reached for the baseline and the context network respectively. Your zip archive must contain only those two mat-files placed in the folders following the directory tree shown in figure 3. Do not forget to include all used non-standard functions you have created.

Fig. 3: Example of directory tree.

For the phone reading contest, upload everything into '11_contest' task. The zip file will contain 'results.mat', 'approach.txt', 'contest.m' together with all non-standard functions you have created.


To test your codes, run test_publish.m that calls test.m, the main test script and generates a html page. Compare your results with ours.


  1. A. Krizhevsky, I. Sutskever, G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS, 2012. PDF
  2. A. Vedaldi, K. Lenc. MatConvNet – Convolutional Neural Networks for MATLAB. In ACM Int. Conf. on Multimedia, 20015. web
  3. C. L. Zitnick, P. Dollar. Edge Boxes: Locating Object Proposals from Edges. In ECCV, 2014. PDF
  4. Y. LeCun, L. Bottou, Y. Bengio and P. Haffner: Gradient-Based Learning Applied to Document Recognition, Proceedings of the IEEE, 86(11):2278-2324, November 1998. PDF

Jan Čech 2016/04/26 17:07

courses/mpv/labs/5_convolutional_networks/start.txt · Last modified: 2018/02/19 15:05 (external edit)