Image retrieval

A widely used image retrieval approach is based on the so-called Bag-of-Words (BoW) model. Each image is represented by a set of visual words from a visual codebook. In the previous labs, images were represented by a set of local descriptors extracted from image patches. We now create visual words in such a space of local descriptors. The visual codebook is obtained by quantization of the local descriptor space. In particular, this is performed via K-means on a large number of local descriptors extracted from many images. Each centroid is a visual word, which can be seen as the representative of a frequently occurring image patch. Given the visual codebook, each local descriptor of an image is represented by the id of the nearest centroid to it. The BoW approach represents an image by the set of visual words that appear in it. There is an analogy with the use of words in natural language and representing text documents by the set of words appearing in them.

For the purposes of this lab, we have created the visual codebook already and obtained the visual words for images. We provide you with the image files, the set of visual words in each image, and the corresponding geometry/shape of the respective local feature the visual word is obtained from.

In this lab we will implement a system to perform image retrieval.

  1. Assignment 1: We will use the BoW approach to represent images and use an inverted-file to perform the search efficiently. In this way, we will obtain an initial ranking of the database images according to their visual similarity to the query.
  2. Assignment 2: As a second step, we will perform fast spatial verification on a short-list of top-ranked images. This will allow use to estimate the geometric transformation between each top-ranked image and the query image, but also to find inlier correspondences. The number of inliers will be used to re-rank the short-list.

1. Retrieval with Bow TF-IDF and an inverted-file

BoW image vector - TF-IDF weighting

In real images, visual words tend to appear with different frequencies. Similar to words in the text domain, some words are more frequent than others. To handle this, we use the TF-IDF weighting scheme TF-IDF (term frequency–inverse document frequency). The BoW image vector is a k-dimensional vector per image, where $k$ is the number of visual words. The value in the $i$-th dimension is equivalent to $\mathrm{tf}_{i} \cdot \mathrm{idf}_{i}$, where $\mathrm{tf}_{i}$ is the number of occurrences of visual word $i$ in the image, and $\mathrm{idf}_{i}$ is the importance of visual word $i$ according to the inverted-document-frequency give by

$$ \mathrm{idf_{i}} = \log \frac{N}{n_i},$$

where $N$ is the total number of images and $n_i$ is the number of images containing visual word $i$. Visual words that are not present in any image get assigned to 0 weight, $\mathrm{idf_{i}}=0$.

  • Implement a function get_idf(image_visual_words, number_of_visual_words), which computes IDF weighting for every word and returns them in an array format with corresponding indices to those of the codebook.

BoW image similarity - inverted file

Each image is represented by the BoW TF-IDF vector. Similarity between two images is computed by the cosine similarity of the BoW vectors as:

$$\displaystyle\mathrm{score}(\mathbf{x},\mathbf{y}) = \cos{\varphi} = \frac{\mathbf{x}\cdot \mathbf{y}}{\|\mathbf{x}\|\|\mathbf{y}\|} = \frac{1}{\|\mathbf{x}\|\|\mathbf{y}\|}\sum_{i=1}^k x_i y_i = \sum_{i=1}^k \hat{x}_i \hat{y}_i, $$

where we take advantage of the fact that a part of the similarity can be computed ahead and L2 normalize the vectors elements (the L2 norm of the vector becomes equal to 1). After that, the similarity can be computed as a simple dot product of two vectors.

The visual vocabulary can be very big, often with a few thousands or even millions of visual words. To simplify the similarity estimation between such long and sparse vectors, the so called inverted file is used. Inverted file is a structure, which for each visual word (A, B, C, D) contains a list of images $\alpha, \beta,\gamma$ in which the visual word appears together with the corresponding TF-IDF weight. The example below represents that but for simplicity it does not include the IDF weights nor the L2 normalization.

We will represent the database as a sparse matrix. A sparse matrix is a structure for the memory-efficient representation of huge matrices where most of the entries are zeros. Therefore, it is an implementation of an inverted-file. We will use the scipy implementation ( see scipy documentation for details). Our inverted file will be a 2D sparse matrix, where columns are images, and rows are visual words, while columns should be L2 normalized.

  • Implement a function create_db(image_visual_words, number_of_visual_words,idf), which builds the inverted file in the form of a sparse matrix of size number_of_visual_words x number_of_images. The first input argument contains the list of visual words per image, and the last is an array of idf weights per visual word. You can have a look at csr_matrix in scipy csr_matrix.

Image retrieval with a query image

Given a query image and its BoW TF-IDF vector, we can use the inverted file to estimate the similarity to all other images. The similarity is then sorted to perform the retrieval. We will use one of the images stored in the inverted-file as a query. As a sanity check, the query image should be ranked first with similariry 1.0 due to the L2 normalization.

  • Implement a function retrieve(inverted_file, query_visual_words, idf), which computes the similarity of the query, represented by its set of visual words, to all images in the inverted file. Argument idf contains the idf weights per visual word so that the BoW TF-IDF vector for the query can be created. The result of this function is an array ranking with indices of images according to descending similarity to the query and sim array with the respective similarities.

What you should upload?

We are providing the overall pipeline in file and (found here) with some missing parts to be written by you (marked by “your code” in comments). Upload this script after adding your modifications. It performs retrieval from 1499 images using a visual codebook of 50,000 visual words. The data and images to run this script is found here and there.

2. Fast Spatial Verification.

An initial ranking of the database images is obtained according to the BoW similarity (first part of the lab). In this part, we will apply spatial verification on a short-list of shortlist_size top ranked images and re-rank them according to the number of inliers. These images are also called candidate images in the following.

Tentative correspondences

Spatial verification takes tentative correspondences between local features of the two images as input. If two local features are in the same visual word form then they form a correspondence. A tentative correspondence is defined as a pair of indices; the indices are indicating which local features of the query and the database image is the correspondence formed between.

  • Implement a function get_tentative_correspondencies(query_visual_words, image_visual_words), which computes tentative correspondences for all candidate images. The input is an array of visual words for the query and a list of arrays of visual words in the candidate images.

Transformation hypothesis and inlier counting

Our local features are represented by affine frames. For this reason, we choose affine transformation as the geometric model for spatial verification which allows us to generate a transformation hypothesis from a single tentative correspondence. The hypothesis from a single correspondence is the affine transform that transforms one affine frame to the other. There is no need to do random sampling anymore, we can go through all correspondences, verify each of them and save the best. Therefore, the algorithm becomes deterministic in contrast to RANSAC. We can verify all hypotheses (one for each tentative correspondence) and save the best one. The number of inliers will be the new score, which we will use for the re-ranking of the short-list.

  • Implement a function ransac_affine(query_geometry, candidatelist_geometry, correspondences, inlier_threshold), which estimates the number of inliers and the best hypothesis A of transformation from query to each of the database image in the short-list. Input arguments provide the geometry of the local features for the query and the candidate images, the correspondences between those and the inlier threshold.

What you should upload?

We are providing the overall pipeline in file and (found here) with some missing parts to be written by you (marked by “your code” in comments). Upload this script after adding your modifications. It performs retrieval from 1499 images using a visual codebook of 50,000 visual words. The data and images to run this script is found here and there.

3. Possible extras for further understanding (not for bonus points)

  1. Create a GUI to crop query regions and use more images as queries
  2. Show and investigate more images from the top-retrieved results
  3. Investigate the impact of the RANSAC threshold. What happens for small and large threshold? How can you fix the problem?
  4. Plot the tentative correspondences and the inliers between the query and each of the top-retrieved images
courses/mpv/labs/3_indexing/start.txt · Last modified: 2024/02/12 15:57 (external edit)