This page is located in archive.

How to start work

  1. unpack to $MDNETDIR
  - ln -s /opt/cv/tracking/datasets/OTB $MDNETDIR/dataset/OTB
  2. ln -s /opt/cv/tracking/datasets/vot-dataset/vot2015 $MDNETDIR/dataset/VOT/2015
  - ln -s /opt/cv/tracking/datasets/vot-dataset/vot2016 $MDNETDIR/dataset/VOT/2016
  3. cd $MDNETDIR
  4. run matlab
  5. run setup_mdnet
  6. run compile_matconvnet

MDNet pretraining

You will implement in drop-in manner several key functions of the VOT2015 Challenge winner - MDNet. We will use MatConvNet as deep learning library for this task.

Here are useful links explaining MatConvNet functions:

Manual, Classification tutorial , Regression tutorial

Channel ordering in MatConvNet is Height*Width*Channels*Num.

MDNet tracker is working as following:


  1. Load conv1 - conv3 pretrained VGG-M network
  2. Randomly initialize new fc4-fc6 layers. Note, that despite their name, all new layers are actually convolutional, not fully-connected. Set new layers learning rate = 10 * lr. fc6 outputs score for image being centered and fully contained given tracking object.
  3. Finetune conv1-conv3 (and learn fc4-fc6 from scratch) on subset of tracking sequences

Running with online finetuning

  1. Train new fc6 layer on initial frame.

For each next frame:

  1. Sample object locations (windows) around previous object position
  2. Estimate scores, set new object position as mean of top-5 scored windows.
  3. Store positive and negative samples for update.

Each 10th frame:

  1. Mine hard-negatives: negative samples with best detection scores
  2. Fine-tune network

You will update some functions for this process.

New layers initialization and learning rate manipulations

Download code All-in-one-MDNet.tar.gz

1. Write a function layers = add_fc_layers_to_net(layers), which adds new layers on top of existing net.

Layers, which needs to be added: fc4, fc5, fc6, loss


  • learning rate: 10 for weights, 20 for biases.
  • type: conv. softmaxloss for loss.
  • weight decay: 1 for weight, 0 for biases,
  • stride: 1
  • pad: 0

Example from tutorial :

       net.layers{1} = struct(...
   'name', 'conv1', ...
     'type', 'conv', ...
  'weights', {{randn(10,10,3,2,'single'), randn(2,1,'single')}}, ...
  'pad', 0, ...
  'stride', 1) ;
  net.layers{2} = struct(...
  'name', 'relu1', ...
  'type', 'relu') ;

Initialize with gaussian noise, std = 0.01.

2. Write a function net = mdnet_add_domain_specific_head_with_high_lr(opts,K), which adds new, 2K-way fc6 layer on top of existing net and followed by softmaxloss_k loss layer.

3. Write a function net = mdnet_finish_train(net), which which sets learning rate to (weights = 1, biases = 2) for all layers, and replaces fc6 layer with new 2-way classifier, with x10 learning rates.

Implementing gradient step update

Write a function [net,res] = train_mdnet_on_batch(net,res,batch,labels,seq_id, opts), which does:

  1. Forward-backward path by calling mdnet_simplenn, result of operation is stored in res
  2. Updates momentum value for weights and biases using formula: ∇w = ∇w * mom - lr ( w_decay * w + gradient / batch_size) ; where lr = opts.learningRate * filter_lr; w_decay = opts.weightDecay * filter_wd;
  3. Updates filters and biases in network using formula: w = w + ∇w;

Layers parameters are stored in layers structure. Weights gradient for i-th layer is stored in res{i}.dzdw{1} for weights and res{i}.dzdw{2} for biases

To send labels to network, use net.layers{end}.class = labels. dydz is not needed.

Positive and negative examples sampling

Write function [ bb_samples ] = gen_samples(type, bb, n, opts, max_shift, scale_f). Function should generate n samples, around center of bounding box bb. Bounding box format is [left top width height]

''type'' parameter for sampling method, is string, which could be:
  - 'gaussian' -- generate samples from a Gaussian distribution centered at bb. Will be used for positive samples, target candidates.                     
  - 'uniform'  -- generate samples from a uniform distribution around bb, Will be used for negative samples.
  - 'uniform_aspect' -- generate samples from a uniform distribution around bb with varying aspect ratios. Will be used for training samples for bbox regression.
  - 'whole' -- generate samples from the whole image. Will be used for negative samples at the initial frame.
 ''max_shift'' - maximum shift of generated window sample in pixels from object center
 ''scale_std'' - std of window radius scale. Final scale should be proportional to ''opts.scale_factor'' ^ ''scale_std''. 

Functions randsample and rand might be helpful for you.

Get target location estimate

Write a function [targetLoc,target_score] = estimate_target_location(targetLoc, img, net_conv, net_fc, opts, max_shift, scale_std). Bounding box format is [left top width height]

  1. targetLoc – predicted bounding box. As input parameter - target location in previous frame.
  2. target_score – classifier score for predicted location
  3. net_conv – first part of the net, containing layers up to conv3
  4. net_fc – fc4-fc6 layers.

Function should generate sample candicates, evaluate classifier score and output location of the best scoring window. You could also try to average top-k predictions.

Hint: use mdnet_features_convX and mdnet_features_fcX MDNet functions for network inference.

Run tracking

Run demo_tracking.m on several sequences with provided mdnet_otb-vot15.mat. Check if everything works.

Run MD-Net pretraining

Run function demo_pretraining.m (needs GPU and several hours). Check that training error becomes lower each iteration. Store resulting model mdnet_otb-vot15_new.mat

Compare authors model with yours

Run demo_tracking.m on several sequences with mdnet_otb-vot15_new.mat. Compare to provided model.

courses/ucuws17/labs/10_cnn_tracking.txt · Last modified: 2017/01/21 19:45 by mishkdmy