Searching in large image databases -- Image retrieval

In the previous labs we were searching for correspondences between image pairs. A disadvantage of this method is its quadratic time complexity in the number of images. In this lab we'll look at image (or object) retrieval – a common problem in many computer vision applications. The goal is to find images corresponding to our query image in a large image database. A naive method – trying to match all pairs – is too slow for large databases. Moreover the set of similar images can be only a tiny fraction of the whole database. Therefore, faster and more efficient methods have been developed. r

1. Image Representation with a Set of Visual Words.

One of the fastest methods for image retrieval is based on the so called bag-of-words model. The basic idea is to represent each image in a database by a set of visual words from a visual vocabulary. The visual word can be imagined as a representative of an often occurring image patch. With a visual vocabulary, images can be represented by visual words similarly to documents in a natural language.

In the previous labs, the images were represented by a set of descriptions of normalized patches. Visual words can be chosen from a space of such descriptions. Each description can be understood as a point in this space. We'll choose the visual words by vector quantization - one visual word for each cluster of descriptors. We will implement one method commonly used for vector quantization (clustering): K-means.

Vector Quantization - K-means

One of the simplest method for vector quantization is k-means.

Initialization: Choose a defined number $k$ of clusters centers (the red crosses in the image). Random points can be chosen from the set of descriptors, but more sophisticated method exists.
Assign all points to the nearest center – points are assigned to <latex>k</latex> clusters.
Shift each center into the center of gravity (mean) of the assigned points (compute the center of gravity from points in the same cluster). In the case when the mean has no assigned points a new position has to be chosen for this mean.

Repeat steps 2 and 3 until the assignments in step 2 do not change, or until the global change of point-to-mean distances decreases below some threshold.

2. Oxford by way of Paris

Your task is to quantize the Paris6K dataset . The Paris Dataset consists of 6412 images collected from Flickr by searching for particular Paris landmarks. By quantizing Paris, the visual words you estimate from K-means will describe the architecture of the city. Remember that the SIFT descriptors were computed from distinctive image patches extracted from the Paris image dataset. So if you look at the image patches associated with SIFT clusters (SIFTs assigned to the same visual word) found by K-means, you might see recurring architectural motifs in the city. As with any learning task, we won't test the learned visual vocabulary (visual words) on the training set, Paris6k . Instead, we'll want to perform our image retrieval tasks in the following labs on the Oxford5k dataset, Oxford5k . This will test how your vocabulary generalizes for a completely independent image set. You might ask, how independent is it? In other words, how much does Paris look like Oxford?

3. Your task

Implement k-means.

Write a function [idxs, dists]=label(data,means), which finds the nearest vector from the set of means (matrix of SINGLEs with size DxK, where D is the dimension of the descriptor space and K is the number of means) for each column vector in matrix data (of SINGLEs with size DxN, where N is the number of points). The output of this function are labels idxs (matrix 1xN) assining the data to their nearest means and distances dists (matrix 1xN), which are the corresponding distances to the nearest mean.
Write a function [means, J]=kmeans(data, K), which finds K means in the vector space data and return the sum J of distances between data points and means to which they are assigned.

The data has been prepared for you on the virtual machine. You will use the single precision SIFTs extracted from Paris6K dataset. Provided is a function to postprocess the SIFTs so that you will achieve better retrieval performance. Depending on your memory constraints you will have to implement the nearest neighbor calculation cleverly, so that you don't exhaust RAM.

Table of Contents

Searching in large image databases -- Image retrieval

1. Image Representation with a Set of Visual Words.

Vector Quantization - K-means

2. Oxford by way of Paris

3. Your task