Tracking with Correlation Filters (KCF)

An object to be tracked is usually selected by a rectangular bounding box. The task of a tracker is to follow the object in the video by updating the bounding box parameters (the position at the simplest case).

The basic idea of the correlation filter tracking is estimating an optimal image filter such that the filtration with the input image produces a desired response. The desired response is typically of a Gaussian shape centered at the target location, so the score decreases with the distance.

The filter is trained from translated (shifted) instances of the target patch. When testing, the response of the filter is evaluated and the maximum gives the new position of the target. The filter is trained on-line and updated successively with every frame in order the tracker adapts to moderate target changes.

Major advantage of the correlation filter tracker is the computation efficiency. The reason is that the computation can be performed efficiently in the Fourier domain. Thus the tracker runs super-realtime, several hundreds FPS.

There are both linear and non-linear (kernel) versions of the tracker that are derived from a unifying least square principle by Henriques et al. [1].

Linear Correlation Filter

The optimal linear filter $w$ is found by solving the regularized least squares problem

\begin{equation} \min_w \Bigl( || X w - y ||^2 + \lambda ||w||^2 \Bigr), \end{equation}

where $X$ is a circulant matrix of the target image patch. The rows of $X$ store all possible cyclic image shifts, $y$ is the desired response, $\lambda$ is a weight of the regularizer.

The solution of (1) is \begin{equation} w = (X^T X + \lambda I)^{-1} X^T y \end{equation}

Since $X$ is a circulant matrix, $w$ in (2) can be computed quickly using Fourier domain operations \begin{equation} \hat w = \frac{\hat x \odot \hat y}{\hat x^* \odot \hat x + \lambda}, \end{equation} where the symbols denote: $\hat ~$ the Fourier image, $\odot$ component-wise multiplication, $^*$ complex conjugate. Moreover, response of the filter $r$ on the test image $z$ is not computed by a sliding window, but again more efficiently by \begin{equation} \hat r = \hat z^* \odot \hat w. \end{equation}

Kernelized Correlation Filter

By using the “kernel-trick”, i.e. mapping input data by a non-linear function $x -> \varphi(x)$, and expressing the solution as the linear combination $w = \sum_i \alpha_i \varphi(x_i)$, eq. (1) becomes \begin{equation} \min_{\alpha} \Bigl( || K \alpha - y ||^2 + \lambda \alpha^T K \alpha \Bigr), \end{equation} where matrix K is the kernel matrix with elements of dot products $k_{i,j} = \varphi(x_i)^T \varphi(x_j)$. The original problem thus becomes non-linear, a.k.a kernelized ridge regression.

The solution of problem (5) is \begin{equation} \alpha = (K + \lambda I)^{-1} y, \end{equation}

which can be computed efficiently by \begin{equation} \hat \alpha = \frac{\hat y}{\hat k^{xx} + \lambda}, \end{equation} together with the fast detection response \begin{equation} \hat r = \hat k^{xz} \odot \hat \alpha. \end{equation}

The above equation is not valid for all kernels, but it is for e.g. the RBF Gaussian kernel: $k(x, x') = \exp(-\frac{1}{\sigma^2}||x-x'||^2)$, which leads to \begin{equation} k^{xx'} = \exp\Bigl(-\frac{1}{\sigma^2}\bigl(||x||^2 + ||x'||^2 - 2{\cal{F}}^{-1}(\hat x^\ast \odot \hat x')\bigr) \Bigr). \end{equation}

Note that for a linear kernel $k(x, x') = x^T x'$, it leads to \begin{equation} k^{xx'} = {\cal{F}}^{-1}(\hat x^* \odot \hat x'). \end{equation} Then by substituting (10) into (7) and (8), the result of the linear correlation tracker (3) and (4) respectively is obtained.

KCF tracker in practice

Besides the efficiency, the tracker also possesses a great implementation simplicity. The code of the the tracker core follows:

% Input:
%  x - training image patch, H x W x C (height, width, channels)  
%  y - regression target (Gaussian shape), H x W
%  z - test image patch, H x W x C
%
% Output:
%  responses - detection score for each location H x W
%
% parameters:
%  lambda - regularization weight (default: 1e-4)
%  sigma - Gaussian kernel bandwidth (default: 0.2)
 
function alphaf = kcf_train(x, y, sigma, lambda)
k = kernel_correlation(x, x, sigma);
alphaf = fft2(y) ./ (fft2(k) + lambda);
end
 
function responses = kcf_detect(alphaf, x, z, sigma)
k = kernel_correlation(x, z, sigma, kernel_type);
responses = real(ifft2(alphaf .* fft2(k)));
end
 
function k = kernel_correlation(x1, x2, sigma, kernel_type)
c = ifft2(sum(conj(fft2(x1)) .* fft2(x2), 3));
d = x1(:)'*x1(:) + x2(:)'*x2(:) - 2 * c;
k = exp(-1 / sigma^2 * abs(d) / numel(d)); %Gaussian kernel
%k = c; %for linear kernel
end

Simply, for the first frame, the model is trained with the initial position of the target object. The patch is larger than the tracked object to provide some context and the input patch is weighted by a cosine window to mitigate the boundary artifacts of the circular shifts.

For a new frame a test image page is set by the current bounding box position. The target is detected over the test image patch as the maximum score location and the bounding box is updated. The new model is trained at the new position. And finally, to provide the tracker with some memory, both alphaf and x are updated by a convex combination of the current and the previous states:

\begin{equation} \hat \alpha^{(t)} = \gamma \hat \alpha + (1-\gamma) \hat \alpha^{(t-1)}, \\ x^{(t)} = \gamma x + (1-\gamma) x^{(t-1)}, \end{equation} where parameter $\gamma \in [0,1]$ controls the tracker adaptation. Default $\gamma$ is 0.075.

Your task

Your task will be to track an object in the Billa-leaflet video billa.avi. You will implement the KCF tracker and generate a video where the tracked bounding box is superimposed into every video frame.

The video is processed by the following algorithm:

Open the input video
Set the target object bounding box → bbox0
While not EOF
1. read video frame → I
2. IF the 1st frame,
  - Init the tracker state [S] = track_init_kcf(I, bbox0, opts)
3. ELSE
  - Update tracker state [S] = track_kcf(S, I)
4. END
5. Visualize show_kcf_tracker_state(S, I)
6. Add the result into output video
7. Store the result in the output structure → out
end

Tracker state structure S contains:

S.bbox %(4x1) current bounding box [top_left_x, top_left_y, bot_right_x, bot_right_y]
S.lambda %1e-4 regularization constant
S.sigma %0.2 RBF Gaussian kernel bandwidth
S.gamma %0.075 adaptation rate, see Eq.(11)
S.kernel_type %{rbf|linear}
S.window %(WxHxC) cosine input window
 
S.x %(WxHxC) training image patch 
S.y %(WxH) target response, Gaussian shape with sigma=max(W,H)/10.
S.z %(WxHxC) test image patch 
 
S.alphaf %(WxH) current tracker parameters
S.resp %(WxH) current response score image
S.resp_max %(1x1) maximum score

See test_track_video.m foir details. Template for both track_init_kcf.m and track_kcf.m together with all necessary functions and data are found here: ZIP.

What should you upload?

You are supposed to upload functions: track_init_kcf.m, track_kcf.m, the output video you have generated, together with all used non-standard functions you have created.

Testing

To test your codes, run test_publish.m that calls test.m, the main test script and generates a html page. Compare your results with ours. Your output video should be similar to our video.

References

Henriques J. F., Caseirio R., Martins P., Batista J. High-Speed Tracking with Kernelized Correlation Filters. IEEE Trans on PAMI 37(3):583-596. 2015

Table of Contents