Warning

This page is located in archive.
Go to the latest version of this course pages.

An object to be tracked is usually selected by a rectangular bounding box. The task of a tracker is to follow the object in the video by updating the bounding box parameters (the position at the simplest case).

The basic idea of the correlation filter tracking is estimating an optimal image filter such that the filtration with the input image produces a desired response. The desired response is typically of a Gaussian shape centered at the target location, so the score decreases with the distance.

The filter is trained from translated (shifted) instances of the target patch. When testing, the response of the filter is evaluated and the maximum gives the new position of the target. The filter is trained on-line and updated successively with every frame in order the tracker adapts to moderate target changes.

Major advantage of the correlation filter tracker is the computation efficiency. The reason is that the computation can be performed efficiently in the Fourier domain. Thus the tracker runs super-realtime, several hundreds FPS.

There are both linear and non-linear (kernel) versions of the tracker that are derived from a unifying least square principle by Henriques et al. [1].

The optimal linear filter $w$ is found by solving the regularized least squares problem

\begin{equation} \min_w \Bigl( || X w - y ||^2 + \lambda ||w||^2 \Bigr), \end{equation}

where $X$ is a circulant matrix of the target image patch. The rows of $X$ store all possible cyclic image shifts, $y$ is the desired response, $\lambda$ is a weight of the regularizer.

The solution of (1) is \begin{equation} w = (X^T X + \lambda I)^{-1} X^T y \end{equation}

Since $X$ is a circulant matrix, $w$ in (2) can be computed quickly using Fourier domain operations \begin{equation} \hat w = \frac{\hat x \odot \hat y}{\hat x^* \odot \hat x + \lambda}, \end{equation} where the symbols denote: $\hat ~$ the Fourier image, $\odot$ component-wise multiplication, $^*$ complex conjugate. Moreover, response of the filter $r$ on the test image $z$ is not computed by a sliding window, but again more efficiently by \begin{equation} \hat r = \hat z^* \odot \hat w. \end{equation}

By using the “kernel-trick”, i.e. mapping input data by a non-linear function $x -> \varphi(x)$, and expressing the solution as the linear combination $w = \sum_i \alpha_i \varphi(x_i)$, eq. (1) becomes \begin{equation} \min_{\alpha} \Bigl( || K \alpha - y ||^2 + \lambda \alpha^T K \alpha \Bigr), \end{equation} where matrix K is the kernel matrix with elements of dot products $k_{i,j} = \varphi(x_i)^T \varphi(x_j)$. The original problem thus becomes non-linear, a.k.a kernelized ridge regression.

The solution of problem (5) is \begin{equation} \alpha = (K + \lambda I)^{-1} y, \end{equation}

which can be computed efficiently by \begin{equation} \hat \alpha = \frac{\hat y}{\hat k^{xx} + \lambda}, \end{equation} together with the fast detection response \begin{equation} \hat r = \hat k^{xz} \odot \hat \alpha. \end{equation}

The above equation is not valid for all kernels, but it is for e.g. the RBF Gaussian kernel: $k(x, x') = \exp(-\frac{1}{\sigma^2}||x-x'||^2)$, which leads to \begin{equation} k^{xx'} = \exp\Bigl(-\frac{1}{\sigma^2}\bigl(||x||^2 + ||x'||^2 - 2{\cal{F}}^{-1}(\hat x^\ast \odot \hat x')\bigr) \Bigr). \end{equation}

Note that for a linear kernel $k(x, x') = x^T x'$, it leads to \begin{equation} k^{xx'} = {\cal{F}}^{-1}(\hat x^* \odot \hat x'). \end{equation} Then by substituting (10) into (7) and (8), the result of the linear correlation tracker (3) and (4) respectively is obtained.

Besides the efficiency, the tracker also possesses a great implementation simplicity. The code of the the tracker core follows:

% Input: % x - training image patch, H x W x C (height, width, channels) % y - regression target (Gaussian shape), H x W % z - test image patch, H x W x C % % Output: % responses - detection score for each location H x W % % parameters: % lambda - regularization weight (default: 1e-4) % sigma - Gaussian kernel bandwidth (default: 0.2) function alphaf = kcf_train(x, y, sigma, lambda) k = kernel_correlation(x, x, sigma); alphaf = fft2(y) ./ (fft2(k) + lambda); end function responses = kcf_detect(alphaf, x, z, sigma) k = kernel_correlation(x, z, sigma, kernel_type); responses = real(ifft2(alphaf .* fft2(k))); end function k = kernel_correlation(x1, x2, sigma, kernel_type) c = ifft2(sum(conj(fft2(x1)) .* fft2(x2), 3)); d = x1(:)'*x1(:) + x2(:)'*x2(:) - 2 * c; k = exp(-1 / sigma^2 * abs(d) / numel(d)); %Gaussian kernel %k = c; %for linear kernel end

Simply, for the first frame, the model is trained with the initial position of the target object. The patch is larger than the tracked object to provide some context and the input patch is weighted by a cosine window to mitigate the boundary artifacts of the circular shifts.

For a new frame a test image page is set by the current bounding box position. The target is detected over the test image patch as the maximum score location and the bounding box is updated. The new model is trained at the new position. And finally, to provide the tracker with some memory, both `alphaf`

and `x`

are updated by a convex combination of the current and the previous states:

\begin{equation} \hat \alpha^{(t)} = \gamma \hat \alpha + (1-\gamma) \hat \alpha^{(t-1)}, \\ x^{(t)} = \gamma x + (1-\gamma) x^{(t-1)}, \end{equation} where parameter $\gamma \in [0,1]$ controls the tracker adaptation. Default $\gamma$ is 0.075.

Your task will be to track an object in the Billa-leaflet video billa.avi. You will implement the KCF tracker and generate a video where the tracked bounding box is superimposed into every video frame.

The video is processed by the following algorithm:

- Open the input video
- Set the target object bounding box →
`bbox0`

- While not EOF
- read video frame →
`I`

- IF the 1st frame,
- Init the tracker state
`[S] = track_init_kcf(I, bbox0, opts)`

- ELSE
- Update tracker state
`[S] = track_kcf(S, I)`

- END
- Visualize
`show_kcf_tracker_state(S, I)`

- Add the result into output video
- Store the result in the output structure →
`out`

- end

Tracker state structure `S`

contains:

S.bbox %(4x1) current bounding box [top_left_x, top_left_y, bot_right_x, bot_right_y] S.lambda %1e-4 regularization constant S.sigma %0.2 RBF Gaussian kernel bandwidth S.gamma %0.075 adaptation rate, see Eq.(11) S.kernel_type %{rbf|linear} S.window %(WxHxC) cosine input window S.x %(WxHxC) training image patch S.y %(WxH) target response, Gaussian shape with sigma=max(W,H)/10. S.z %(WxHxC) test image patch S.alphaf %(WxH) current tracker parameters S.resp %(WxH) current response score image S.resp_max %(1x1) maximum score

See `test_track_video.m`

foir details. Template for both `track_init_kcf.m`

and `track_kcf.m`

together with all necessary functions and data are found here: ZIP.

You are supposed to upload functions: `track_init_kcf.m`

, `track_kcf.m`

, the output video you have generated, together with all used non-standard functions you have created.

To test your codes, run `test_publish.m`

that calls `test.m`

, the main test script and generates a html page. Compare your results with ours. Your output video should be similar to our video.

- Henriques J. F., Caseirio R., Martins P., Batista J. High-Speed Tracking with Kernelized Correlation Filters. IEEE Trans on PAMI 37(3):583-596. 2015

courses/mpv/labs/4_tracking/4b_tracking_kcf.txt · Last modified: 2018/02/19 15:05 (external edit)