====== Object tracking ====== Video tracking is a process for position estimation of an object in time using a camera. Tracking algorithm analyzes video frames and estimates a position of the object. A simple models for tracking are for instance: * A planar object with motion model of 2D image transformation (affine or homography). * A rigid 3D object with a motion model depending on its 3D position and orientation. * Object with a feature points (e.g. Harris points) or regions (e.g. MSERs) in the image, which are tracking separately. This provides robustness to occlusions. * Non-rigid (deformable) object approximated with the surface wireframe. The motion of an object is defined by position of wireframe edges. Popular tracking algorithms are: correlation, mean-shift, Kanade-Lucas-Tomasi (KLT) tracking, Kalman tracking, particle filtering, (sequential) linear predictors and so on. In this course you will familiarize with tracking using the correlation and the KLT tracker. ====== Tracking I, tracking by finding correspondences, correlation. ====== All files needed for this lab can be found in ''/opt/cv/tracking'' of your virtual machine. We already know how to find correspondences between images. The task of tracking is very similar. It is searching for object correspondences (of image frame, region, feature point with a neighborhood) in a given frame with subsequent unknown frames. The difference between correspondence finding in two wide-baseline images is the fact, that now we can expect only small changes of the object (position, deformation, photometric) between successive frames. Nevertheless, the changes of the object appearance in a whole sequence can be significant. Selection of the object for tracking is done manually or by automatic detection (which is outside of scope of this lab). You can choose to implement one of these two methods: * tracking by correspondence finding using methods from [[../3_1_cspond/start | third retrieval task]] * tracking by correlations in Harris points neighborhood Since we will be working with video in Matlab, download function ''processMpvVideo(filename,method,options)'', where ''filename'' is name of the videofile (e.g. ''example.avi''), ''method'' is a tracking method and ''options'' is a structure with parameters for tracking function. Function creates a video sequence with tracked points plotted and writes the output into folder ''./export/'' and shows a preview meantime. However, to avoid codecs problem, video is converted to sequence of images and ''processMpvVideo'' will read them and save output sequence of images into folder ''./export/''. For joining into video, which you have to submit use: * Linux: ''convert export/*.jpg export/%05d.JPG; ffmpeg -r 25 -i export/%05d.JPG export/vid.mp4'' You will implement all methods for tracking into functions ''xNew = track_//method//(imgPrev,imgNew,xPrev,options)'', where ''imgPrev'' is a previous frame in the sequence, ''xPrev'' are tracked points in the previous frame, ''imgNew'' is a new image in the sequence and ''xNew'' are tracked points founded in the new frame. Tracked points are represented in a structure, x.x % COLUMN vector of ''x'' coordinates x.y % COLUMN vector of ''y'' coordinates x.ID % COLUMN vector unique identifier of the point (index to the array is not enough because some points will disappear during tracking) x.data % specific information for the tracking method (e.g. SIFT description) As we mentioned before, tracking algorithms tracks the selected object during the entire sequence. Selection of the object for tracking will be done in two phases - ''processMpvVideo'' calls function ''data = processMpvInit(img,options)'', which selects the object for tracking in the first image ''img'' of the sequence. In our case, it is represented by a bounding box. data.xRect %x coordinates of bounding box corners (anticlockwise) data.yRect %y coordinates of bounding box corners (anticlockwise) The second function, ''x = track_init(img,options)'', finds points ''x'' for tracking. ''options'' is a structure with parameters. At the end, for each image the following function is called ''processMpvFrame(data,imgPrev,imgNew,xPrev,xNew,options)'', which uses founded tracking points, for instance for estimating a homography transformation between frames. Set structure ''options'' and call function ''processMpvVideo.m'' in "main" script ''cv08_//method//.m'', which you will also submit. To ease your task and better understanding of the parameters, we prepared templates of described functions, which simulates constant shift of randomly selected points. {{http://cmp.felk.cvut.cz/~mishkdmy/video-lab1_TASK.zip|All-in-one}} {{:courses:ae4m33mpv:labs:4_tracking:struktura_programu_en.png|}} ===== Your task ===== Your task is to hide an identity of a selected object in a video sequence. You will select the object in the first frame (for instance your least favorite vegetable) and after that you will blur this object in the whole sequence (in a similar way as you know from TV). It can be done in two ways: * direct tracking of the selected object * assuming that object is not moving, tracking the whole scene and bluring the right place If we know or can assume that there is only one scene in the sequence and it is planar (eg. images from airplane), we can use the second method - tracking the scene and estimating a homography, in the same way as in [[../03_cspond/start?#the_model_of_a_planar_scene| third retrieval task]] or course [[http://cmp.felk.cvut.cz/cmp/courses/TZ/2010/HomeWorks/HW-06/|A4M33TZ]]. It will create more robust tracking. In obligatory part of this task we will track the whole scene and estimate the frame homographies. Optionally you can compare this method with the direct object tracking. The goal of this task is to familiarize and to fill in the universal framework for tracking and implement a simple tracking algorithm based on your previous knowledge (either correlations or correspondences). Use the successfully tracked points in the scene for homography estimation between images and for trajecting of the rectangle across the sequence. - Familiarize yourself with tracking framework and testing sequence: * xvid {{courses:a4m33mpv:cviceni:4_sledovani_objektu:billa_xvid.avi|}} * jpeg {{courses:a4m33mpv:cviceni:4_sledovani_objektu:billa_jpg.zip|}} * MPEG4 {{courses:a4m33mpv:cviceni:4_sledovani_objektu:billa_mpeg4.avi|}} * WMV2 {{courses:a4m33mpv:cviceni:4_sledovani_objektu:billa_wmv2.avi|}} - Choose implementation using correspondences or correlations. - Implement detection of tracking point in the whole image in function ''track_init_//method//.m'' - Implement tracking of the points of interest in function ''track_//method//.m'' - Try your algorithm on the testing video and draw tracked points - From tracked points estimate homographies between images - Transform the bounding box of the selected object - Blur the image inside the bounding box - Join points 6-8 into function ''processmpvframe.m'' - Generate the video with the blurred object Example of bounding box transformation (without blurring) {{courses:a4m33mpv:cviceni:4_sledovani_objektu:export_billa_xvid.avi|}} ===== Tracking using correspondences ===== On the base of your knowledge and use of your code from the second task you should be able to implement a function for finding corresponding points between successive frames of the sequence and estimate the homography between the frames given tentative correspondences. There is only a little portion of the new code. The suitable combination for a first experiment are Harris points and DCT descriptors. If your algorithms from the second task would not give you good results, you can try to add a constraint that the distance between corresponding points in two frames cannot be too large (Due to a slow motion, the scene cannot change much between successive frames). The method you will choose is up to you. The minimal acceptable solution for this task is your two functions ''track_init_correspond.m'' and ''track_correspond.m'', that track and transform the selection during the whole sequence. ===== Tracking using the correlation ===== Tracking using the correlation finds correspondences by directly comparing image intensities between the feature point neighborhoods, unlike constructing a compact normalized descriptor (SIFT, DCT, etc.), We will use the correlation coefficient as the similarity statistic. * Correlation coefficeint (also called normalized cross-correlation, NCC) is defined as $ NCC(T, I) = \frac{\sum_{\mathbf{k}}\sum_{\mathbf{l}}(T(k,l) - \overline{T})(I(x+k,y+l)-\overline{I(x,y)})} {\sqrt{\sum_{\mathbf{k}}\sum_{\mathbf{l}}(T(k,l) - \overline{T})^2} \sqrt{\sum_{\mathbf{k}}\sum_{\mathbf{l}}(I(x+k,y+l) - \overline{I(x,y)})^2}} $ where T(x,y) is image template window and I(x,y) is target image window. NCC value is inside the range <-1,+1> and is commonly used as a statistic for similarity between image windows, see matlab function ''corr2'' . /*It is possible to compute NCC in a [[http://www.idiom.com/%7Ezilla/Papers/nvisionInterface/nip.html|more effective way]]. Such implementation will be rewarded with one bonus point.*/ * Advantages of NCC: invariance to affine transformation of intensity * Disadvantages of NCC: not invariant to scale changes, rotation and perspective distortion. For some applications the computational complexity is the problem. Harris corner points are often suitable for tracking. By using function ''harris(img,''\sigma_d'',''\sigma_i'',thresh)'' write a function ''x = track_init(img,options)'', which will return Harris points ''x'' in a chosen region of interest ''options.ROI''. Structure ''options'' will contain fields ''sigma_d'', ''sigma_i'' and ''thresh''. As we mentioned above, the method is analogous to the method for correspondence finding - detect Harris points in each frame (''track_init'' will be called also in ''track_corr''!) and find a correspondences for them using the NCC. Set the size of the neighborhood with parameter ''options.ps'' and compare the results with a different setting. Assuming only small shifts between successive frames, it is not necessary (and also not desirable) to compute the correlation between all pairs. For each point from the first image, compute only correlations with points in the second image up to some distance. Obviously, too distant points cannot correspond. Set this threshold with parameter ''options.corr_max_dist''. The value depends on a character of the motion in actual sequence. For our task we recommend the value 30px. To identify promising matches, i.e. correspondences of points with a high correlation coefficient, use for instance the principle of mutual nearest correspondences (two points are paired if the second point is the closest one from points in the second image and simultaneously the first point is the closest one from points in the first image). If there is no point in a new frame that corresponds to the tracked point from the previous frame, this tracked point is discarded and not tracked anymore. Join this functionality into ''xNew = track_corr(imgPrev,imgNew,xPrev,options)''. ===== Homography estimation ===== If potentially corresponding points between frames (tentative correspondences) are known and assuming only a (approximately) planar scene, then the homography between frames can be estimated. We will use our function ''[Hbest,inl]=ransac_h(u,threshold,confidence)'' from the second task. Create the vector of points ''u''. The ''ID''s are known and you also know that ''xPrev'' contains all points from ''xNew''. Keep the setting in fields ''options.rnsc_threshold'' and ''options.rnsc_confidence''. Discard the points which are homography outliers from further tracking. Advanced algorithms has methods for adding new tracking points, however we will not implement any in this course. Knowing the homography (matrix ''H''), you can transform (x_{n} = H_{best} x_{p}) the corners of bounding box which outlines the selected object from one frame to next frame. Blur the interior region with function ''gaussfilter.m'' with high enough sigma. Do not blur the image outside of the selection (you can use your knowledge from this [[http://cmp.felk.cvut.cz/cmp/courses/TZ/2010/HomeWorks/HW-06/|course]]). Join your functions into '' [dataOut xNewOut] = processMpvFrame(data,imgPrev,imgNew,xPrev,xNew,options)''. This function returns structure ''dataOut'' containing dataOut.xRect %transfomed x-coordinates of bounding box corners (anti-clockwise) dataOut.yRect %transfomed y-coordinates of bounding box corners (anti-clockwise) dataOut.H %estimated homography and ''xNewOut'' containing tracked points in the current frame without outliers. Within this function, implement also a drawing of the current image frame into a figure with a blurred selection and show the transformed bounding box. ===== What you should upload? ===== Submit into upload system: the file ''cv08.m'' with structure ''options'' for your implementation, your function ''processmpvframe.m'' and (''track_init_correspond.m'' and ''track_correspond.m'') **or** (''track_init.m'' and ''track_corr.m''), together with all used non-standard functions you have created. Submit also generated video-file ''export_billa_xvid.avi'' with the blurred selection and the generated file ''homography.mat'' with all homographies. The minimal acceptable solution for this task is to implement one of the two simple methods for tracking in a way that it will generate enough points and without adding new points it will find homographies for the entire sequence (acceptable minimum is 20 point in the scene) Please **do not** submit the function ''procesMpvVideo.m''. It complicates the automatic evaluation. ===== Testing ===== To test your code, you can use a matlab ''script test.m'' and a function 'publish'. Compare your results with [[http://cmp.felk.cvut.cz/~cechj/mpv/track-08/Vysledky%20ulohy%20%2708_track%27%20z%20testovaciho%20skriptu.htm|ours]]. As a quality measure of your algorithm, you can use the last part of the test, where you are tracking points on the image which is transformed with known homography and in ideal case you will get the same homography back. Other comparisons are with KLT tracker from [[http://sourceforge.net/projects/opencvlibrary/|Open CV]]. ===== Where to go next ===== Tracking using correspondences has several disadvantages * necessity of feature detection (harris points...) in next image * precision is given with precision of detected points We will show in the next task, how we can look for the selection (or its transformation) in the next image. The object will be selected only in the first image. ====== 2. Kanade-Lucas-Tomasi Tracking (KLT tracker) ====== KLT minimizes sum of squared difference of image intensities between windows in subsequent frames. The minimum is found iteratively by Newton-Raphson method. We are given a patch template $T(\mathbf{x})$ centered at pixel $\mathbf{x} = [x,y]^T$ in image frame at time $t$. In a subsequent frame, at time $t+1$, the target moves to a new position described coordinate transformation $\mathbf{W}(\mathbf{x;p})=[x+p_x;y+p_y]^T$. The task is to estimate displacement $\mathbf{p} = [p_x, p_y]^T$. (1)\;\;\sum_{\mathbf{x}}(I(\mathbf{W}(\mathbf{x;p})) - T(\mathbf{x}))^2. Consider the best shift is \ \Delta \mathbf{p} , from (1) we will get (2)\;\;\sum_{\mathbf{x}}(I(\mathbf{W}(\mathbf{x;p}+\Delta \mathbf{p})) - T(\mathbf{x}))^2. We minimize this expression with respect to \ \Delta \mathbf{p} . Nonlinear expression (2) is linearized by (first order) Taylor expansion (3)\;\;\sum_{\mathbf{x}}(I(\mathbf{W}(\mathbf{x;p})) + \nabla I\frac{\partial \mathbf{W}}{\partial \mathbf{p}}\Delta \mathbf{p}- T(\mathbf{x}))^2, where \nabla I = [\frac{\partial I}{\partial x},\frac{\partial I}{\partial y}] is gradient at \ \mathbf{W}(\mathbf{x;p}). Term \frac{\partial \mathbf{W}}{\partial \mathbf{p}} is Jacobian matrix of the coordinate transformation (4)\;\; \frac{\partial \mathbf{W}}{\partial \mathbf{p}}= \frac{ \partial[x+p_x;y+p_y]^T}{\partial [p_x,p_y]^T}= \left[\begin{array}{cc} 1 & 0\\ 0 & 1 \end{array}\right]. The minimum of expresion (3) over $\Delta \mathbf{p}$ is (5)\;\;\Delta \mathbf{p}=H^{-1}\sum_{\mathbf{x}}[\nabla I\frac{\partial \mathbf{W}}{\partial \mathbf{p}}]^T [T(\mathbf{x})-I(\mathbf{W}(\mathbf{x;p}))] where H is approximation of the Hessian matrix used in Gauss-Newton gradient method. This is a nonlinear regression in fact (nonlinear least squares). Note that the approximation of the Hassian matrix in this case is equal to the autocorrelation matrix (Harris matrix), i.e. a dot product of the first partial derivatives. This suggests that a good idea is to track the points in near neighborhood around Harris points (6)\;\; H = \sum_{\mathbf{x}}[\nabla I\frac{\partial \mathbf{W}}{\partial \mathbf{p}}]^T[\nabla I\frac{\partial \mathbf{W}}{\partial \mathbf{p}}] Substituting (4) into (5) and (6) simplifies \nabla I\frac{\partial \mathbf{W}}{\partial \mathbf{p}}=\nabla I . The displacement correction $\Delta \mathbf{p}$ is computed in each iteration and the estimated shift is updated by (7)\;\; \mathbf{p} \leftarrow \mathbf{p} + \Delta \mathbf{p} The iterations are terminated by setting the maximum number of iteration steps and/or by convergence condition (8)\;\; ||\Delta \mathbf{p}||\leq \epsilon. ===== Your task ===== Implement KLT tracking algorithm, estimation of Harris point translations, and test it on the familiar sequence with promotional leaflet - If you have not finished the previous task, implement function ''track_init.m'' for Harris point detection in the image - Implement function ''getPatchSubpixel.m'' for sub-pixel selection in image - Implement KLT algorithm into function ''track_klt.m'' - Try an algorithm on the sequence from the first part of this task. Transform the selection and find homographies between frames in the same way as before. Integrate the whole process into ''cv09.m'' ===== KLT algorithm ===== For the KLT algorithm, it is necessary to implement all operations in a sub-pixel precision. Think about it. Try to find a simple example, in which the non-subpixel argorithm is not enough. You will need a function for a patch selection ''patch = getPatchSubpix(img,x,y,win_x,win_y)'', where the ''patch'' is a selection from image ''img'' around center ''x,y'' with size ''win_x*2+1 x win_y*2+1''. Assume that ''win_x,win_y'' are integers, but ''x,y'' are real. Function ''interp2.m'' can be useful in your implementation. **Tip:** You will get much faster computation, if you crop the image before using ''interp2.m''. The KLT algorithm can be summarized in a few steps: For template T(\mathbf{x}) the neighborhood of Harris point \mathbf{x} in previous image ''xPrev'', set \mathbf{p} = [0 ; 0] and iterate: - Take patch I ( \mathbf{W}(\mathbf{x;p})) from new image ''imgNew'' the neighborhood of Harris point \mathbf{x} with current shift \mathbf{p} - Estimate the error E = I(\mathbf{W}(\mathbf{x;p})) - T(\mathbf{x}) . - Computes gradients \nabla I at translated coordinates \mathbf{W}(\mathbf{x;p}) . - Approximate Hessian matrix H with dot product H = \sum_{\mathbf{x}}[\nabla I]^T[\nabla I]. - Estimate displacement \Delta \mathbf{p}= H^{-1}\sum_{\mathbf{x}}[\nabla I]^T [T(\mathbf{x})-I(\mathbf{W}(\mathbf{x;p}))] - Update the translation estimate by \mathbf{p} \leftarrow \mathbf{p} + \Delta \mathbf{p} - Test the convergence ||\Delta \mathbf{p}||\leq \epsilon. Set the new position of Harris point \mathbf{x} \leftarrow \mathbf{x} + \mathbf{p} If the algorithm did not converge in the maximum number of steps, the best decision would be to discard the point from further tracking, because the real shift was not found likely. %% Implementation - example [Gx,Gy]= ... ; % gradient estimation g = [ Gx(:) Gy(:) ]; H = g'*g; % approximation of hessian matrix with dot product A simplified illustrative scheme of the algorithm demonstrated on a car tracking is on the figure below \\ {{courses:a4m33mpv:cviceni:4_sledovani_objektu:example.png}} \\ Implement function ''xNew = track_klt(imgPrev,imgNew,xPrev,options)'', where parameters are known from the previous task and the structure contains: options.klt_window % size of patch W(x,p) in the sense of getPatchSubpix.m options.klt_stop_count % maximal number of iteration steps options.klt_stop_treshold % minimal change epsilon^2 for termination (squared for easier computation of the distance) options.klt_show_steps % 0/1 turning on and of drawing during tracking **Tip:** Do not ignore warnings from Matlab and use operator ''\'' rather than function ''inv()'' To easily check your implementation, include a drawing function after each iteration. if (options.klt_show_steps) showKltStep(step,T,I,E,Gx,Gy,aP); end % step - serial number of iteration (zero based) % T - template, (patch from imgPrev) % I - current sifted patch in imgNew % E - current error (I - T) % Gx,Gy - gradients % aP - size of current shift delta P ===== What you should upload? ===== Submit file ''cv09.m'' together with structure setting ''options'' of your implementation into the upload system. Include also completed functions ''track_init.m'', ''track_klt.m'' and ''getPatchSubpix.m'' together with all used non-standard functions you have created. Submit the generated video file ''export_billa_xvid.avi'' with blurred selection and a highlighted bouding box and generated file ''homography.mat'' with all homographies. ===== Testing ===== To test and for implementation download {{courses:a4m33mpv:cviceni:4_sledovani_objektu:showkltstep.m|}} for visualization of KLT iteration. To test your code, you can use a matlab script and a function 'publish'. Copy {{http://cmp.felk.cvut.cz/~mishkdmy/video-lab1_TASK.zip}}, unpack and add your files, which are requested for submission. Compare your results with [[http://cmp.felk.cvut.cz/~cechj/mpv/track-09/Vysledky%20ulohy%20%2709_klt%27%20z%20testovaciho%20skriptu.htm|ours]]. ===== References ===== [[http://www.ri.cmu.edu/pub_files/pub3/baker_simon_2002_3/baker_simon_2002_3.pdf|Lucas-Kanade 20 Years On: A Unifying Framework]] [[http://www.youtube.com/watch?v=1GhNXHCQGsM|Predator: A Smart Camera that Learns]]