Search
In the previous labs, we were searching for correspondences between image pairs. A disadvantage of this method is its quadratic time complexity in the number of images. In this lab, we'll look at image (or object) retrieval – a common problem in many computer vision applications. The goal is to find images corresponding to our query image in an extensive image database. A naive method – trying to match all pairs – is too slow for large databases. Moreover, the set of similar images can be only a tiny fraction of the whole database. Therefore, faster and more efficient methods have been developed…
One of the fastest methods for image retrieval is based on the so-called bag-of-words model. The basic idea is to represent each image in a database by a set of visual words from a visual vocabulary. The visual word can be imagined as a representative of an often occurring image patch. With a visual vocabulary, images can be represented by visual words similarly to documents in a natural language.
In the previous labs, the images were represented by a set of descriptions of normalized patches. Visual words can be chosen from a space of such descriptions. Each description can be understood as a point in this space. We'll choose the visual words by vector quantization - one visual word for each cluster of descriptors. We will implement one method commonly used for vector quantization (clustering): K-means.
One of the simplest method for vector quantization is k-means. You should already know this algorithm from the Pattern Recognition course:
Repeat steps 2 and 3 until the assignments in step 2 do not change, or until the global change of point-to-mean distances decreases below some threshold.
Implement k-means.
nearest(means, data)
kmeans(K, data)
We prepared the data for you (in the test archive at the bottom of this assignment) to make this assignment a little bit easier. (So the next paragraph is just to put things into a context. You don't have to do these steps in first assignment.)
If we would like to use the functions kmeans and nearest to represent image by visual words, we would do so as follows:
kmeans
nearest
detect_and_describe
After the previous step, every image is represented by a vector of visual words. By summing the occurrences of the same words we get a vector representation of the image with counts of visual words (see rows <latex>\alpha</latex>,<latex>\beta</latex>,<latex>\gamma</latex> with occurences of words A,B,C,D on the picture below, left). To be able to search effectively in the image database, we need to estimate the similarity. The standard way is to sum the distances of corresponding descriptions. In the bag of words method the vector quantization is used to approximate the description distance. The distance between descriptions is 0 if they are assigned to the same visual word and infinity otherwise. For images represented as vectors of visual words we define similarity as:
<latex>\displaystyle\mathrm{score}(\mathbf{x},\mathbf{y}) = \cos{\varphi} = \frac{\mathbf{x}\cdot \mathbf{y}}{\|\mathbf{x}\|\|\mathbf{y}\|} = \frac{1}{\|\mathbf{x}\|\|\mathbf{y}\|}\sum_{i=1}^D x_i y_i</latex>
where x and y are vectors of visual words (bags of words). We can take advantage of the fact that a part of the similarity can be computed ahead and normalize the size of vectors. After that, the similarity can be computed as a simple dot product of two vectors.
The visual vocabulary can be very big, often with a few thousands or even millions of words. To simplify the distance estimation between such long vectors, the so called inverted file is used. Inverted file is a structure, which for each visual word (A, B, C, D in the picture) contains a list of images <latex>\alpha</latex>,<latex>\beta</latex>,<latex>\gamma</latex> in which the word appears together with the multiplicity.
We will represent the database as a sparse matrix. A sparse matrix is a structure for the memory-efficient representation of huge matrices where most of the entries are zeros. We will use scipy implementation ( see scipy documentation for details). Our inverted file will be a 2D sparse matrix, where columns are images, and rows are visual words. For instance, the third row of the matrix will contain weights of the third word in all images. We compute the weight of the word from its frequency divided by the length of the vector of visual words (i.e., square root of the sum of squared frequencies).
create_db(vw, num_words)
num_words
In real images, visual words tend to appear with different frequencies. Similar to words in the text, some words are more frequent than others. The number of visual words in one image is changing depending on the scene's complexity and the detector. To deal with these differences, we have to introduce a suitable weighting of visual words. One weighting scheme, widely used also in text analysis and document retrieval, is called TF-IDF (term frequency–inverse document frequency). Weight of each word $i$ in the image $j$ consists of two values:
$$ \mathrm{tf_{i,j}} = \frac{n_{i,j}}{\sum_k n_{k,j}},$$
where $n_{i,j}$ is the number of occurences of word $i$ in image $j$; and
$$ \mathrm{idf_{i}} = \log \frac{|D|}{|\{d: t_{i} \in d\}|},$$
where $|D|$ is the number of documents and $|\{d: t_{i} \in d\}|$ is the number of documents containing visual word $i$. For words that are not present in any image, <latex>\mathrm{idf_{i}}=0</latex>. The resulting weight is:
$$ \mathrm{(tf\mbox{-}idf)_{i,j}} = \mathrm{tf_{i,j}} \cdotp \mathrm{idf_{i}}$$
We need two functions
get_idf(vw, num_words)
create_db
create_db_tfidf(vw, num_words, idf)
Thanks to the inverted file, we can rank images according to their similarity to the query. The query is defined by a bounding box in the image around the object of interest. Take the visual words which are contained inside the rectangle. Compute the length and weights of visual words of the query and multiply it with the corresponding columns of the inverted file, matrix DB. The result is a similarity matrix between the query and database images. (You don't have to create visual words selection on your own for now. It is mentioned to show context).
query(DB, q, idf)
Upload file image_search.py with the before mentioned implemented methods.
To test your code, you can use a jupyter notebook test.ipynb in assignment_4_5_indexing_template. Data can be found here tfidf_test.zip. The notebook is also provided with expected results in commentaries. Feel free to use function get_A_matrix_from_geom in spatial_verification.py which returns geometry arranged into affine matrix.
After an image ranking according to vector similarities, we find the set of images, which have common words with the query. In an ideal case, we will find several views of the same scene. We know that the visual words are representing parts of the scene in the image. Therefore, the coordinates of the visual words in images of the same scene are spatially related. This relation is defined by the mutual position of the scene and the camera at the moment when the images were taken. We can use this relation to verify if the images contain the same scene or object.
The tentative correspondences of visual words will be needed for spatial verification. They will be used to discover a geometric transformation between the coordinates of visual words from the query and words from each relevant image.
After ordering according to tf-idf, we have a set of K relevant images. These are images that have at least several visual words common with the query. For each pair, query and a similar image, we find tentative correspondences. The same visual words represent features with similar descriptions. Still, there can be several features with the same visual words in one image and we have to try each pair because we do not have the original description anymore. If there are M features with the same particular visual word in the query image and N features with the same word in a similar image, the tentative corresponding pairs (pairs of word indices in the image) will be a Cartesian product of the feature sets. We treat each visual word in the image this way. There may be a lot of features with the same visual word in both images, which produces a large number of tentative pairs (NxM), from which only min(N,M) can be true correspondences. For this reason we will set two limits on the total number of tentative correspondence in each image pair (max_tc) and the maximal number of pairs (max_MxN). Moreover, to control the number of tentative correspondences, we will prefer the visual words with smaller product MxN. In summary, when computing the tentative correspondences, the following scenario is applied: 1) sort common visual words w.r.t. MxN, 2) keep adding tentative correspondences in ascending order of MxN until a termination condition is met (either MxN goes beyond max_MxN or the number of correspondences goes beyond max_tc).
K
M
N
min(N,M)
max_tc
max_MxN
MxN
get_tentative_correspondencies(query_vw, vw, relevant_idxs, max_tc, max_MxN)
query_vw
vw
We need a geometric model for the verification tentative correspondences. In the same manner, as in the previous task, we can use a homography or an epipolar geometry, but the verification of a huge amount of images can be too time-consuming. It happens quite often that only a small fraction of tentative correspondences are true correspondences. Therefore, it is more feasible to take a simpler model. We can use the fact that our feature points are not only points but affine (resp. similarity) frames. Remember that a frame defines a geometric relation between the canonical frame and an image. If we compose an affine (resp. similarity) transformation from query image into canonical frame and from canonical frame into relevant image from the database, we will get an affine (resp. similarity) transformation of visual word neighborhood from the query image to the image from the database.
It is clear that geometric transformation between the query and the image from the database is not necessarily affine (resp. similarity). Still from the assumption that we deal with a planar object, we can take it as a local approximation of perspective transformation. With a big enough threshold, we can consider all “roughly” correct points to be consistent. The advantage of this method is its speed. From each tentative correspondence, we get a hypothesis of transformation – it is not necessary to use sampling. This way, we can verify all hypotheses for each pair query–database image and save the best one. The number of inliers will be the new score, which we will use for the reordering of relevant images.
ransac_affine(q_geom, geometries, correspondencies, relevant, inlier_threshold)
query_spatial_verification(query_visual_words, query_geometry, bbox, visual_words, geometries, DB, idf, params)
ransac_affine
query
get_tentative_correspondencies
(optional task for 2 bonus points)
The basic idea of query expansion is to use knowledge of spatial transformation between query and database image, for enriching the set of visual words of the query. In the first step, correct images (results) for the first query have to be chosen. To avoid the expansion of unwanted images, it is necessary to choose only images that are really related to query image. This can be done by choosing only images with high enough score (for instance, at least 10 inliers). For each chosen image, the centers of visual words are projected by transformation A-1 and words which are projected inside the query bounding box are added into the query. It is a good idea to constrain the total number of visual words in the expanded query. It is desirable to choose visual words, which add information (it is useless to add a visual word from each image with frequency 100, it will only cause a problem during correspondence finding). Therefore, the words are ordered according to tf, and then the first max_qe_keypoints words are chosen.
tf
query_spatial_verification
To test your code, you can use a jupyter notebook (where you can also see a reference solution). In template you are also provided with utils.py where you can find tools for results visualization. Copy the file with mpvdb_haff2.mat database, unpack archive with images and put it into data directory in template.
Upload your image_search.py and spatial_verification.py in a single archive into 06_spatial task. Do not upload utils.py.
For this task, there is no auto evaluation; results will be checked by hand!
On the set of 1499 images (225 images by your colleagues from a previous year, 73 images of Pokemons, 506 images from dataset ukbench) and 695 images from the dataset Oxford Buildings) we have computed features with detector sshessian with Baumberg iteration (haff2, returns affine covariant points). We have estimated dominant orientation and computed SIFT descriptors. With k-means algorithm (with approximate NN) we have estimated 50000 cluster centers from SIFT descriptions and we have assigned all descriptions. You can find the data in file mpvdb50k_haff2.mat, where array of cells with the lists of visual words is stored in variable VW, array of cells with geometries in variable GEOM and names of the images in the variable NAMES. Cluster centers are stored in file mpvcx50k_haff2.mat in variable CX. Compressed images are in file mpvimgs1499.zip. Descriptions to images are in file mpvdesc_haff2.mat, stored as array of cells DESC. Elements of these cells are matrices of type UINT8 of size 128xNi, where Ni is number of points in ith image. Compute results on the database of haff2 detector (it should work better for more difficult perspective transformation).
Choose 7 images:
On each image chose rectangle with object of query. Choose visual words, which have centers inside the bounding box (including boundary). With function query_spatial_verification query the database and find results. Prepare a jupyter notebook where you show your results using utils.query_and_visu(q_id, visual_words, geometries, bbox_xyxy, DB, idf, options.copy(), img_names, figsize=(7, 7)) for visualization (as done in jupyter notebook in previous task).
utils.query_and_visu(q_id, visual_words, geometries, bbox_xyxy, DB, idf, options.copy(), img_names, figsize=(7, 7))
Then you transform your notebook into file results.html using jupyter nbconvert –to html my_results_on_big_db.ipynb –ExecutePreprocessor.timeout=60 –EmbedImagesPreprocessor.resize=small –output results.html
results.html
jupyter nbconvert –to html my_results_on_big_db.ipynb –ExecutePreprocessor.timeout=60 –EmbedImagesPreprocessor.resize=small –output results.html
Then put results.html into archive and upload file into 07_big_db.
figsize=(7, 7)