Object tracking

Video tracking is a process for position estimation of an object in time using a camera. Tracking algorithm analyzes video frames and estimates a position of the object.

A simple models for tracking are for instance:

A planar object with motion model of 2D image transformation (affine or homography).
A rigid 3D object with a motion model depending on its 3D position and orientation.
Object with a feature points (e.g. Harris points) or regions (e.g. MSERs) in the image, which are tracking separately. This provides robustness to occlusions.
Non-rigid (deformable) object approximated with the surface wireframe. The motion of an object is defined by position of wireframe edges.

Popular tracking algorithms are: correlation, mean-shift, Kanade-Lucas-Tomasi (KLT) tracking, Kalman tracking, particle filtering, (sequential) linear predictors and so on. In this course you will familiarize with tracking using the correlation and the KLT tracker.

Tracking I, tracking by finding correspondences, correlation.

All files needed for this lab can be found in /opt/cv/tracking of your virtual machine.

We already know how to find correspondences between images. The task of tracking is very similar. It is searching for object correspondences (of image frame, region, feature point with a neighborhood) in a given frame with subsequent unknown frames. The difference between correspondence finding in two wide-baseline images is the fact, that now we can expect only small changes of the object (position, deformation, photometric) between successive frames. Nevertheless, the changes of the object appearance in a whole sequence can be significant. Selection of the object for tracking is done manually or by automatic detection (which is outside of scope of this lab).

You can choose to implement one of these two methods:

tracking by correspondence finding using methods from third retrieval task
tracking by correlations in Harris points neighborhood

Since we will be working with video in Matlab, download function processMpvVideo(filename,method,options), where filename is name of the videofile (e.g. example.avi), method is a tracking method and options is a structure with parameters for tracking function. Function creates a video sequence with tracked points plotted and writes the output into folder ./export/ and shows a preview meantime.

However, to avoid codecs problem, video is converted to sequence of images and processMpvVideo will read them and save output sequence of images into folder ./export/. For joining into video, which you have to submit use:

Linux: convert export/*.jpg export/%05d.JPG; ffmpeg -r 25 -i export/%05d.JPG export/vid.mp4

You will implement all methods for tracking into functions xNew = track_method(imgPrev,imgNew,xPrev,options), where imgPrev is a previous frame in the sequence, xPrev are tracked points in the previous frame, imgNew is a new image in the sequence and xNew are tracked points founded in the new frame. Tracked points are represented in a structure,

  x.x    % COLUMN vector of ''x'' coordinates
  x.y    % COLUMN vector of ''y'' coordinates
  x.ID   % COLUMN vector unique identifier of the point (index to the array is not enough because some points will disappear during tracking)
  x.data % specific information for the tracking method (e.g. SIFT description)

As we mentioned before, tracking algorithms tracks the selected object during the entire sequence. Selection of the object for tracking will be done in two phases - processMpvVideo calls function data = processMpvInit(img,options), which selects the object for tracking in the first image img of the sequence. In our case, it is represented by a bounding box.

  data.xRect    %x coordinates of bounding box corners (anticlockwise)
  data.yRect    %y coordinates of bounding box corners (anticlockwise)

The second function, x = track_init(img,options), finds points x for tracking. options is a structure with parameters.

At the end, for each image the following function is called processMpvFrame(data,imgPrev,imgNew,xPrev,xNew,options), which uses founded tracking points, for instance for estimating a homography transformation between frames.

Set structure options and call function processMpvVideo.m in “main” script cv08_method.m, which you will also submit.

To ease your task and better understanding of the parameters, we prepared templates of described functions, which simulates constant shift of randomly selected points.

All-in-one

Your task

Your task is to hide an identity of a selected object in a video sequence. You will select the object in the first frame (for instance your least favorite vegetable) and after that you will blur this object in the whole sequence (in a similar way as you know from TV).

It can be done in two ways:

direct tracking of the selected object
assuming that object is not moving, tracking the whole scene and bluring the right place

If we know or can assume that there is only one scene in the sequence and it is planar (eg. images from airplane), we can use the second method - tracking the scene and estimating a homography, in the same way as in third retrieval task or course A4M33TZ. It will create more robust tracking.

In obligatory part of this task we will track the whole scene and estimate the frame homographies. Optionally you can compare this method with the direct object tracking.

The goal of this task is to familiarize and to fill in the universal framework for tracking and implement a simple tracking algorithm based on your previous knowledge (either correlations or correspondences). Use the successfully tracked points in the scene for homography estimation between images and for trajecting of the rectangle across the sequence.

Familiarize yourself with tracking framework and testing sequence:
- xvid billa_xvid.avi
- jpeg billa_jpg.zip
- MPEG4 billa_mpeg4.avi
- WMV2 billa_wmv2.avi
Choose implementation using correspondences or correlations.
Implement detection of tracking point in the whole image in function track_init_method.m
Implement tracking of the points of interest in function track_method.m
Try your algorithm on the testing video and draw tracked points
From tracked points estimate homographies between images
Transform the bounding box of the selected object
Blur the image inside the bounding box
Join points 6-8 into function processmpvframe.m
Generate the video with the blurred object

Example of bounding box transformation (without blurring) export_billa_xvid.avi

Tracking using correspondences

On the base of your knowledge and use of your code from the second task you should be able to implement a function for finding corresponding points between successive frames of the sequence and estimate the homography between the frames given tentative correspondences. There is only a little portion of the new code. The suitable combination for a first experiment are Harris points and DCT descriptors. If your algorithms from the second task would not give you good results, you can try to add a constraint that the distance between corresponding points in two frames cannot be too large (Due to a slow motion, the scene cannot change much between successive frames).

The method you will choose is up to you. The minimal acceptable solution for this task is your two functions track_init_correspond.m and track_correspond.m, that track and transform the selection during the whole sequence.

Tracking using the correlation

Tracking using the correlation finds correspondences by directly comparing image intensities between the feature point neighborhoods, unlike constructing a compact normalized descriptor (SIFT, DCT, etc.), We will use the correlation coefficient as the similarity statistic.

Correlation coefficeint (also called normalized cross-correlation, NCC) is defined as

$ NCC(T, I) = \frac{\sum_{\mathbf{k}}\sum_{\mathbf{l}}(T(k,l) - \overline{T})(I(x+k,y+l)-\overline{I(x,y)})} {\sqrt{\sum_{\mathbf{k}}\sum_{\mathbf{l}}(T(k,l) - \overline{T})^2} \sqrt{\sum_{\mathbf{k}}\sum_{\mathbf{l}}(I(x+k,y+l) - \overline{I(x,y)})^2}} $

where <latex> T(x,y)</latex> is image template window and <latex> I(x,y)</latex> is target image window.

NCC value is inside the range <-1,+1> and is commonly used as a statistic for similarity between image windows, see matlab function corr2 .

Advantages of NCC: invariance to affine transformation of intensity
Disadvantages of NCC: not invariant to scale changes, rotation and perspective distortion. For some applications the computational complexity is the problem.

Harris corner points are often suitable for tracking. By using function harris(img,<latex>\sigma_d</latex>,<latex>\sigma_i</latex>,thresh) write a function x = track_init(img,options), which will return Harris points x in a chosen region of interest options.ROI. Structure options will contain fields sigma_d, sigma_i and thresh.

As we mentioned above, the method is analogous to the method for correspondence finding - detect Harris points in each frame (track_init will be called also in track_corr!) and find a correspondences for them using the NCC. Set the size of the neighborhood with parameter options.ps and compare the results with a different setting. Assuming only small shifts between successive frames, it is not necessary (and also not desirable) to compute the correlation between all pairs. For each point from the first image, compute only correlations with points in the second image up to some distance. Obviously, too distant points cannot correspond. Set this threshold with parameter options.corr_max_dist. The value depends on a character of the motion in actual sequence. For our task we recommend the value 30px.

To identify promising matches, i.e. correspondences of points with a high correlation coefficient, use for instance the principle of mutual nearest correspondences (two points are paired if the second point is the closest one from points in the second image and simultaneously the first point is the closest one from points in the first image).

If there is no point in a new frame that corresponds to the tracked point from the previous frame, this tracked point is discarded and not tracked anymore.

Join this functionality into xNew = track_corr(imgPrev,imgNew,xPrev,options).

Homography estimation

If potentially corresponding points between frames (tentative correspondences) are known and assuming only a (approximately) planar scene, then the homography between frames can be estimated. We will use our function [Hbest,inl]=ransac_h(u,threshold,confidence) from the second task. Create the vector of points u. The IDs are known and you also know that xPrev contains all points from xNew. Keep the setting in fields options.rnsc_threshold and options.rnsc_confidence.

Discard the points which are homography outliers from further tracking. Advanced algorithms has methods for adding new tracking points, however we will not implement any in this course.

Knowing the homography (matrix H), you can transform (<latex>x_{n} = H_{best} x_{p}</latex>) the corners of bounding box which outlines the selected object from one frame to next frame. Blur the interior region with function gaussfilter.m with high enough sigma. Do not blur the image outside of the selection (you can use your knowledge from this course).

Join your functions into [dataOut xNewOut] = processMpvFrame(data,imgPrev,imgNew,xPrev,xNew,options). This function returns structure dataOut containing

  dataOut.xRect    %transfomed x-coordinates of bounding box corners (anti-clockwise) 
  dataOut.yRect    %transfomed y-coordinates of bounding box corners (anti-clockwise)
  dataOut.H        %estimated homography

and xNewOut containing tracked points in the current frame without outliers. Within this function, implement also a drawing of the current image frame into a figure with a blurred selection and show the transformed bounding box.

What you should upload?

Submit into upload system: the file cv08.m with structure options for your implementation, your function processmpvframe.m and (track_init_correspond.m and track_correspond.m) or (track_init.m and track_corr.m), together with all used non-standard functions you have created. Submit also generated video-file export_billa_xvid.avi with the blurred selection and the generated file homography.mat with all homographies.

The minimal acceptable solution for this task is to implement one of the two simple methods for tracking in a way that it will generate enough points and without adding new points it will find homographies for the entire sequence (acceptable minimum is 20 point in the scene)

Please do not submit the function procesMpvVideo.m. It complicates the automatic evaluation.

Testing

To test your code, you can use a matlab script test.m and a function 'publish'. Compare your results with ours.

As a quality measure of your algorithm, you can use the last part of the test, where you are tracking points on the image which is transformed with known homography and in ideal case you will get the same homography back. Other comparisons are with KLT tracker from Open CV.

Where to go next

Tracking using correspondences has several disadvantages

necessity of feature detection (harris points…) in next image
precision is given with precision of detected points

We will show in the next task, how we can look for the selection (or its transformation) in the next image. The object will be selected only in the first image.

2. Kanade-Lucas-Tomasi Tracking (KLT tracker)

KLT minimizes sum of squared difference of image intensities between windows in subsequent frames. The minimum is found iteratively by Newton-Raphson method.

We are given a patch template $T(\mathbf{x})$ centered at pixel $\mathbf{x} = [x,y]^T$ in image frame at time $t$. In a subsequent frame, at time $t+1$, the target moves to a new position described coordinate transformation $\mathbf{W}(\mathbf{x;p})=[x+p_x;y+p_y]^T$. The task is to estimate displacement $\mathbf{p} = [p_x, p_y]^T$.

<latex> (1)\;\;\sum_{\mathbf{x}}(I(\mathbf{W}(\mathbf{x;p})) - T(\mathbf{x}))^2. </latex>

Consider the best shift is <latex>\ \Delta \mathbf{p} </latex>, from (1) we will get

<latex> (2)\;\;\sum_{\mathbf{x}}(I(\mathbf{W}(\mathbf{x;p}+\Delta \mathbf{p})) - T(\mathbf{x}))^2. </latex>

We minimize this expression with respect to <latex>\ \Delta \mathbf{p} </latex>. Nonlinear expression (2) is linearized by (first order) Taylor expansion

<latex> (3)\;\;\sum_{\mathbf{x}}(I(\mathbf{W}(\mathbf{x;p})) + \nabla I\frac{\partial \mathbf{W}}{\partial \mathbf{p}}\Delta \mathbf{p}- T(\mathbf{x}))^2, </latex>

where <latex> \nabla I = [\frac{\partial I}{\partial x},\frac{\partial I}{\partial y}] </latex> is gradient at <latex>\ \mathbf{W}(\mathbf{x;p}). </latex> Term <latex> \frac{\partial \mathbf{W}}{\partial \mathbf{p}} </latex> is Jacobian matrix of the coordinate transformation

<latex>(4)\;\; \frac{\partial \mathbf{W}}{\partial \mathbf{p}}= \frac{ \partial[x+p_x;y+p_y]^T}{\partial [p_x,p_y]^T}= \left[\begin{array}{cc} 1 & 0
0 & 1 \end{array}\right].</latex>

The minimum of expresion (3) over $\Delta \mathbf{p}$ is

<latex> (5)\;\;\Delta \mathbf{p}=H^{-1}\sum_{\mathbf{x}}[\nabla I\frac{\partial \mathbf{W}}{\partial \mathbf{p}}]^T [T(\mathbf{x})-I(\mathbf{W}(\mathbf{x;p}))] </latex>

where <latex> H </latex> is approximation of the Hessian matrix used in Gauss-Newton gradient method. This is a nonlinear regression in fact (nonlinear least squares). Note that the approximation of the Hassian matrix in this case is equal to the autocorrelation matrix (Harris matrix), i.e. a dot product of the first partial derivatives. This suggests that a good idea is to track the points in near neighborhood around Harris points

<latex> (6)\;\; H = \sum_{\mathbf{x}}[\nabla I\frac{\partial \mathbf{W}}{\partial \mathbf{p}}]^T[\nabla I\frac{\partial \mathbf{W}}{\partial \mathbf{p}}]</latex>

Substituting (4) into (5) and (6) simplifies <latex> \nabla I\frac{\partial \mathbf{W}}{\partial \mathbf{p}}=\nabla I </latex>. The displacement correction $\Delta \mathbf{p}$ is computed in each iteration and the estimated shift is updated by

<latex> (7)\;\; \mathbf{p} \leftarrow \mathbf{p} + \Delta \mathbf{p} </latex>

The iterations are terminated by setting the maximum number of iteration steps and/or by convergence condition

<latex> (8)\;\; ||\Delta \mathbf{p}||\leq \epsilon. </latex>

Your task

Implement KLT tracking algorithm, estimation of Harris point translations, and test it on the familiar sequence with promotional leaflet

If you have not finished the previous task, implement function track_init.m for Harris point detection in the image
Implement function getPatchSubpixel.m for sub-pixel selection in image
Implement KLT algorithm into function track_klt.m
Try an algorithm on the sequence from the first part of this task. Transform the selection and find homographies between frames in the same way as before. Integrate the whole process into cv09.m

KLT algorithm

For the KLT algorithm, it is necessary to implement all operations in a sub-pixel precision. Think about it. Try to find a simple example, in which the non-subpixel argorithm is not enough.

You will need a function for a patch selection patch = getPatchSubpix(img,x,y,win_x,win_y), where the patch is a selection from image img around center x,y with size win_x*2+1 x win_y*2+1. Assume that win_x,win_y are integers, but x,y are real. Function interp2.m can be useful in your implementation. Tip: You will get much faster computation, if you crop the image before using interp2.m.

The KLT algorithm can be summarized in a few steps:

For template <latex> T(\mathbf{x}) </latex> the neighborhood of Harris point <latex> \mathbf{x} </latex> in previous image xPrev, set <latex> \mathbf{p} = [0 ; 0] </latex> and iterate:

Take patch <latex> I ( \mathbf{W}(\mathbf{x;p})) </latex> from new image imgNew the neighborhood of Harris point <latex> \mathbf{x} </latex> with current shift <latex> \mathbf{p} </latex>
Estimate the error <latex> E = I(\mathbf{W}(\mathbf{x;p})) - T(\mathbf{x}) </latex>.
Computes gradients <latex> \nabla I </latex> at translated coordinates <latex> \mathbf{W}(\mathbf{x;p}) </latex>.
Approximate Hessian matrix <latex> H </latex> with dot product <latex> H = \sum_{\mathbf{x}}[\nabla I]^T[\nabla I]</latex>.
Estimate displacement <latex> \Delta \mathbf{p}= H^{-1}\sum_{\mathbf{x}}[\nabla I]^T [T(\mathbf{x})-I(\mathbf{W}(\mathbf{x;p}))] </latex>
Update the translation estimate by <latex> \mathbf{p} \leftarrow \mathbf{p} + \Delta \mathbf{p} </latex>
Test the convergence <latex> ||\Delta \mathbf{p}||\leq \epsilon. </latex>

Set the new position of Harris point <latex> \mathbf{x} \leftarrow \mathbf{x} + \mathbf{p} </latex>

If the algorithm did not converge in the maximum number of steps, the best decision would be to discard the point from further tracking, because the real shift was not found likely.

%% Implementation - example
[Gx,Gy]= ... ;        % gradient estimation
g = [ Gx(:) Gy(:) ];
H = g'*g;             % approximation of hessian matrix with dot product

A simplified illustrative scheme of the algorithm demonstrated on a car tracking is on the figure below

Implement function xNew = track_klt(imgPrev,imgNew,xPrev,options), where parameters are known from the previous task and the structure contains:

 options.klt_window         % size of patch W(x,p) in the sense of getPatchSubpix.m
 options.klt_stop_count     % maximal number of iteration steps
 options.klt_stop_treshold  % minimal change epsilon^2 for termination (squared for easier computation of the distance)
 options.klt_show_steps     % 0/1 turning on and of drawing during tracking

Tip: Do not ignore warnings from Matlab and use operator \ rather than function inv()

To easily check your implementation, include a drawing function after each iteration.

 if (options.klt_show_steps)
  showKltStep(step,T,I,E,Gx,Gy,aP);
 end
 % step  - serial number of iteration (zero based)
 % T     - template, (patch from imgPrev)
 % I     - current sifted patch in imgNew
 % E     - current error (I - T)
 % Gx,Gy - gradients
 % aP    - size of current shift delta P

What you should upload?

Submit file cv09.m together with structure setting options of your implementation into the upload system. Include also completed functions track_init.m, track_klt.m and getPatchSubpix.m together with all used non-standard functions you have created. Submit the generated video file export_billa_xvid.avi with blurred selection and a highlighted bouding box and generated file homography.mat with all homographies.

Testing

To test and for implementation download showkltstep.m for visualization of KLT iteration.

To test your code, you can use a matlab script and a function 'publish'. Copy video-lab1_TASK.zip, unpack and add your files, which are requested for submission. Compare your results with ours.

References

Lucas-Kanade 20 Years On: A Unifying Framework

Predator: A Smart Camera that Learns

Table of Contents

Object tracking

Tracking I, tracking by finding correspondences, correlation.

Your task

Tracking using correspondences

Tracking using the correlation

Homography estimation

What you should upload?

Testing

Where to go next

2. Kanade-Lucas-Tomasi Tracking (KLT tracker)

Your task

KLT algorithm

What you should upload?

Testing

References