This page is located in a preparation section till 20.02.2023.

Correspondence Problem

Searching for correspondences is a fundamental task in computer vision. The goal is to find corresponding parts (patches) of a scene in two or more images. Finding correspondences on a pixel level is often ill-posed, and instead correspondences for selected larger image portions (patches) are sought. This comprises of first detecting, in each image, the local features (a.k.a. interest points, distinguished regions, covariant regions) i.e. regions detectable in other images of the same scene with high degree of geometric and photometric invariance. Correspondences are then sought between these local features.

For each feature, the exact position and a description of the feature neighborhood is saved. The description can be a vector of pixel intensities, histogram of intensities, histogram of gradients, etc. The choice of a descriptor influences the discrimination and invariance of the feature. Roughly speaking, local feature descriptor is a vector, which characterises the image function in a small, well localized neighborhood of a feature point.

When establishing correspondences, for each feature in one image we look for features in the second image that have a similar description. We call these correspondences “tentative”, as some of them are usually wrong – they do not associate the same points in the scene. This happens because of noise, repeated structures in the scene, occlusions or inaccuracies of models of geometric and photometric transformation.

As a next step we need a robust algorithm, which (based on the knowledge of the image-to-image transformation model) will choose the largest subset of correspondences which are all consistent with the model. Such correspondences are called „verified“ or „inliers“ (with the model). An algorithm widely used for this purpose is called RANSAC (RANdom Sampling Consensus) and will be introduced it in the last section.

This can be implemented with a few lines of code using OpenCV library (see this notebook). We omit here data load and reshape for clarity:

det = cv2.AKAZE_create()    
kps1, descs1 = det.detectAndCompute(img1,None)
kps2, descs2 = det.detectAndCompute(img2,None)
matcher = cv2.DescriptorMatcher_create(cv2.DescriptorMatcher_BRUTEFORCE_HAMMING)
knn_matches = matcher.knnMatch(descriptors1, descriptors2, 2)
ratio_thresh = 0.8
tentative_matches = []
for m,n in knn_matches:
   if m.distance < 0.8 * n.distance:
src_pts, dst_pts = split_pts(tentative_matches, kps1, kps2)
H, inlier_mask = cv2.findHomography(src_pts, dst_pts, cv2.RANSAC,1.0)

In this course we will implement all these components from scratch using PyTorch and numpy

1. Feature detection

Feature detection is the first step in the process. We will implement two often used feature detectors. Detector of Hessian maxima and Harris detector.

Hessian detector

The goal of the detector is to repeatedly find the feature points which are well localized and there is enough information in their neighborhood for creating a good description. Detector of the local extrema of Hessians – determinant of Hessian matrix (matrix of second order derivatives)

$$\mathbf{H} = \left[\begin{array}{cc} D_{xx}(x,y;\sigma) & D_{xy}(x,y;\sigma)\\ D_{xy}(x,y;\sigma) & D_{yy}(x,y;\sigma) \end{array}\right],$$

finds centers of the so called blobs, local extrema of the intensity function, which are well localized in a neighborhood given scale $\sigma$. Interest points are strong local maxima with value (response) above a threshold t, which is set according to the level of noise in the image.

  • Implement a function hessian_response(img,sigma), which computes Hessian for each pixel of the given image img according to formula: $\mathrm{det}(\mathbf{H}) = D_{xx} D_{yy} - D_{xy}^2$. For computation of the derivatives $D_{xx},D_{yy},D_{xy}$, use the derivative of Gaussian with variance $\sigma^2$.

For the localisation of Hessian extrema we will need a function to find out if a point is a local extremum. This is done by comparing the pixel value with pixels in its neighborhood.

  • Implement a function nms2d(response,thresh) for non-maxima suppression. From input image response (tensor of size BxChHxW, where H stands for image height and W for image width) the function computes a new matrix maximg of the same size, in which all local maxima with response value greater than thresh will have value response and non-maxima points or local maxima with response below thresh will have value 0. The function checks for local maxima in 8-neighbourhood of the point.
  • Connect these two functions into a detector hessian(img,sigma,thresh), which will find Hessian maxima in the image img with scale of gradient sigma. Return only keypoint_locations (b, ch, y, x) which were thresholded to 1 - you can use function torch.nonzero loc=torch.nonzero(maximg) for finding these points in the matrix. Return tensor keypoint_locations of a size [N x 4] where keypoint_locations[0] is a vector with coordinates of the first point, keypoint_locations[1] coordinates of the second, etc. Pay attention to indices, which have order B, Ch, X, Y.

Harris detector

Harris detector detects locations in which the local autocorrelation function is changing in all directions (therefore, the image gradient direction is changing there). For this reason, it is often called a corner detector. It computes the autocorrelation matrix, $\mathbf{C}(x,y;\sigma_d;\sigma_i)$:

$$ \mathbf{C}(x,y;\sigma_d,\sigma_i)=G(x,y;\sigma_i)*\left[\begin{array}{cc} D^2_{x}(x,y;\sigma_d) & D_x D_y(x,y;\sigma_d)\\ D_x D_y(x,y;\sigma_d) & D^2_y(x,y;\sigma_d) \end{array}\right]$$

where * is the convolution operator. This is weighted averaging (windowing function $G(x,y;\sigma_i)$) of outer products of gradients. If both eigenvalues of $\mathbf{C}(x,y;\sigma_d;\sigma_i)$ are large, and are of comparable value then the pattern inside the windowing function has stable position and we want to use it for matching. Harris and Stephens select such points based on corner response:

$$ R(x,y) = \mathrm{det}(\mathbf{C})-\alpha\mathrm{trace}^2(\mathbf{C}). $$

where the advantage is that eigenvalues $\lambda_1,\lambda_2$ of matrix $\mathbf{C}$ do not have to be explicitly computed because

$$ \begin{array}{rcl} \mathrm{det}(\mathbf{C})\!&=&\!\lambda_1\lambda_2\\ \mathrm{trace}(\mathbf{C})\!&=&\!\lambda_1 + \lambda_2. \end{array} $$ Note however that explicit computation of eigenvalues would not be a problem with todays computers (how would you do it?)

Obrázek ukazuje průběh funkce R(x,y) pro alpha=0.04, bílé usečky označují izokontury s hodnotami 0, 0.02, 0.04, 0.06, ..., 0.18.
Graph of $R(x,y)$ for $\alpha=0.04$. White curves are isocontours for values $0, 0.02, 0.04, 0.06, \dots, 0.18$

The image shows the function $R(x,y)$ for $\alpha=0.04$, where white lines show iso-contours with values 0, 0.02, 0.04, 0.06, …, 0.18. When we want to have values greater than a threshold, we want the smaller of the two eigenvalues to be greater than the threshold. We also want to have the value of the greater eigenvalue greater if the smaller eigenvalue is smaller. In the case, when the both eigenvalues are similar, they can be smaller.

  • Write a function harris_response(img,$\sigma_d$,$\sigma_i$), which computes the response of Harris function for each pixel of input image img. For computation, use the derivative of Gaussian $D_{x},D_{y}$ with standard deviation $\sigma_d$ and a Gaussian as window function with standard deviation $\sigma_i$.
  • In the same way as with the detector of Hessian maxima, write a function keypoint_locations=harris(img,$\sigma_d$,$\sigma_i$,thresh), which detects the maxima of Harris function which are greater than thresh, using standard deviation $\sigma^2_d$ for derivatives and $\sigma_i^2$ for the window function.

Scale-space - automatic scale estimation

The basic version of Harris or Hessian detector needs one important parameter: the scale on which it estimates gradients in the image and detects “blobs”. It can be shown that the scale can be estimated automatically for each image. For this purpose, it is necessary to build the “scale-space” – a three dimensional space, in which two dimensions are x,y-coordinates in the image and the third dimension is the scale. The image is filtered to obtain its new versions corresponding to increasing scale. Suppressing the details simulates the situation when we are looking at the scene from greater distance.

With increasing suppression of details the variance of intensities is decreasing. Therefore, selection of characteristic scale is necessary to have comparable gradients between two scales. The term “normalized derivative” (considering the “distance” between pixels) was introduced for this purpose to obtain a “scaleless” gradient.

$$ \dfrac{\partial f}{\partial \xi} = \dfrac{\partial f}{\partial (x/\sigma)} = \sigma \dfrac{\partial f}{\partial x}\mbox{,}\quad N_x(x,y;\sigma) = \sigma D_x(x,y;\sigma)\mbox{,}\quad N_{xx}(x,y;\sigma) = \sigma^2 D_{xx}(x,y;\sigma)\mbox{,} $$

The normalized Hessian matrix is thus: $$ \mathbf{H}_{norm}(x,y,\sigma_i) = \sigma^2 \left[\begin{array}{cc} D_{xx}(x,y;\sigma) & D_{xy}(x,y;\sigma)\\ D_{xy}(x,y;\sigma) & D_{yy}(x,y;\sigma) \end{array}\right]. $$

Lindeberg in his works showed that for such normalized derivatives it is possible to compute the response of differential Laplacian operators


and Hessians

$\mathrm{det}(\mathbf{H}_{norm}(x,y,\sigma_i)) = N_{xx}(x,y;\sigma_i) N_{yy}(x,y;\sigma_i) - N_{xy}^2(x,y;\sigma_i)$

and use it for automatic learning of characteristic scale of ideal blob. After applying these operators, the local maxima of the image will give us x,y-coordinades and a scale $\sigma$ of the blobs.

Image of the Gaussian with standard deviations=8,16,24 and 32. The response of differential operators in scale space in the middle of Gaussians.
Image of the Gaussian with standard deviations=8,16,24 and 32. The response det(Hnorm) and trace(Hnorm) in the middle of Gaussians. Note the shift of the extreme proportional to the size of blob.
  • Write a function ss,sigma=create_scalespace(img,levels,step), which returns a 3D scale-space (matrix BxChxLxHxW, where H is height, W is width and L is number of levels), where ss[:,:,0,:,:] will be input image img, for which we will assume $\sigma_1 = 1$ and ss[:,:,i,:,:] will be it's filtered version $S(x,y,\sigma_{i+1})$, where $\sigma_{i+1}=\mathrm{step}\cdot\sigma_i $ for $i \in \{1,\ldots,\mathrm{levels}\}$. Vector sigma will contain values $\sigma_i$.
  • Write a function maximg=nms3d(response, threshold) for non-maxima suppression in a 5D matrix response (i.e. take to consideration all 26 points in the neigborhood). The function returns a matrix of the same size as input matrix response, in which there will be response in places of local maxima which are greater than thresh and 0 elsewhere.

 Point neighbourhood in 3D.

  • Write a function [hes,sigma]=scalespace_hessian_response(img), which will use function create_scalespace and compute Hessian response for normalized derivation of Gaussian. Use small constant sigma (finite difference) for the hessian_response input, if run on the output of the create_scalespace. Do not forget to multiply response function by the correct power of the corresponding sigma. Choose the parameters step and levels of function create_scalespace wisely (start with e.g. step=1.1, levels=30 and experiment with them). For computation of the response use estimation of the second derivatives which comes from applying of differential filters to the result of function scalespace. Normalize the derivatives with a standard deviation for each order of derivation (as described earlier). The result will be a 5D matrix hes with elements corresponding to normalized response of Hessian in scale-space.
  • Join functions nms3d and scalespace_hessian_response to detector of Hessian maxima with automatic scale estimation keypoint_locations=scalespace_hessian(img, thresh). Return only points with response greater than thresh. The result will be matrix of size [N x 5] for position b, ch, scale, y and x of thresholded Hessian maxima in the space-space. Choose internal sigma, so that visual results make sense. Use the same approach for sigma_i and sigma_d for scalespace_harris Top left point of the image has coordinates (0,0,0,0,0). You can use the following code for localization of points:

# get the matrix of maxima
nmsed=nms3d(response, thresh);
# find positions
loc = torch.nonzero(nmsed)
# Don't forget to convert scale index to scale value with use of sigma

Test your functions on this image:

For visualization of detected points use the function visualize_detections, defined in the local_detector.ipynb: You should obtain an output similar to this:

Image with detected multiscale Hesian maxima

Local feature detection.

To fulfil this assignment, you need to submit these files (all packed in one .zip file) into the upload system:

  • local_detector.ipynb - a notebook for data initialisation, calling of the implemented functions and plotting of their results (for your convenience, will not be checked).
  • - file with the following methods implemented:
    • harris_response, hessian_response - functions for computing response functions
    • nms2d, nms3d, - functions for applying 2d and 3d non maxima supression
    • hessian, harris, - functions for getting coordinates of the single scale Hessian and Harris maxima
    • create_scalespace, - function for creating Gaussian scale space
    • scalespace_hessian_response, scalespace_harris_response, - functions for computing response functions on scalespace tensor
    • scalespace_hessian, scalespace_harris, - unctions for getting coordinates of the multi scale Hessian and Harris maxima
  • - file with implemented functions from the previous assignment

Use template of the assignment. When preparing a zip file for the upload system, do not include any directories, the files have to be in the zip file root.


2. Computing local invariant description

The task of a detector is to reliably find feature points and their neighborhoods such that the detected points are covariant with a desired class of transformations (geometric or photometric). For instance, the basic version of Harris or Hessian detector detects points which are covariant with translation (if you shift the image, detected points shift the same amount). Hessian in the scale-space or DoG detector adds information about the scale of the region and therefore points are co-variant with similarity transformation up to an unknown rotation (detected points have no orientation assigned). Affine covariant detectors like MSER, Hessian-Affine or Harris-Affine are co-variant with affine transformations. The following text describes how to handle geometric information from the point neighbourhood and how to use it for normalization. Depending on the amount of information that a detector gives about a features, the descriptions can be invariant to translation, similarity, affine or perspective transformation.

Geometric normalization

Geometric normalization is a process of geometric transformation of feature neighborhood into a canonical coordinate system. Information about geometric transformation will be stored in the form of “frame” – a projection of the canonical coordinate system into the neighborhood of a feature point (or region) in the image. The frame will be represented by a 3×3 matrix A, the same as in the first lab. The transformation matrix A is used to obtain a “patch” – a small feature neighborhood in the canonical coordinate system. All other measurements on this square patch of original image are now invariant to desired geometric transformation.

Original image Translation Translation
and rotation
Similarity Affine
cutoff1.jpg translation1.jpg euclidean1.jpg similarity1.jpg affine1.jpg
cutoff2.jpg translation2.jpg euclidean2.jpg similarity2.jpg affine2.jpg
Example of geometric normalization, with different classes of transformation (author of images: Štepán Obdržálek).

Another example from Wikipedia: Effect of applying various 2D affine transformation matrices on a unit square. Note that the reflection matrices are special cases of the scaling matrix.

Effect of applying various 2D affine transformation matrices on a unit square. Note that the reflection matrices are special cases of the scaling matrix.

For the construction of a transformation A, we use the geometric information about position and shape of point neighborhood:

  1. For rotation- and translation-covariant transformation, we have coordinates x,y and angle $\alpha$. The scale s (size of the frame) is fixed (manually) for all points.
    $$ A=\underbrace{\left[\begin{array}{ccc} s & 0 & x\\ 0 & s & y\\ 0 & 0 & 1\\ \end{array}\right]}_{A_{norm}} \left[\begin{array}{ccc} cos(\alpha) & sin(\alpha) & 0\\ -sin(\alpha)& cos(\alpha) & 0\\ 0 & 0 & 1\\ \end{array}\right] $$
  2. For similarity-covariant point, we have coordinates x,y scale $\sigma$ and angle $\alpha$.
    $$ A=\underbrace{ \left[\begin{array}{ccc} \sigma & 0 & x\\ 0 & \sigma & y\\ 0 & 0 & 1\\ \end{array}\right]}_{A_{norm}} \left[\begin{array}{ccc} cos(\alpha) & sin(\alpha) & 0\\ -sin(\alpha)& cos(\alpha) & 0\\ 0 & 0 & 1\\ \end{array}\right] $$
  3. For affine-covariant point, we can use three points to which the points from the canonical coordinate system are projected (as in the first lab). We can also use a known position, partial affine transformation (a 2×2 submatrix) and the residual rotation (angle $\alpha$) to construct the matrix A/
    $$ A=\underbrace{ \left[\begin{array}{ccc} a_{11} & a_{12} & x\\ a_{21} & a_{22} & y\\ 0 & 0 & 1\\ \end{array}\right]}_{A_{norm}} \left[\begin{array}{ccc} cos(\alpha) & sin(\alpha) & 0\\ -sin(\alpha)& cos(\alpha) & 0\\ 0 & 0 & 1\\ \end{array}\right] $$

Many detectors (all from the last lab) detect points (frames), which are similarity- or affine- covariant up to an unknown rotation (angle $\alpha$). For instance, the position and scale give us a similarity co-variant point up to unknown rotation. Similarly, the matrix of second moments and the centre of gravity give us five constraints for an affine transformation and only the rotation remains unknown. To obtain a fully similarity- or affine-covariant point, we need to define orientation $\alpha$ from a partially geometrically normalized point neighborhood.

The normalization including orientation estimation has to be done in two steps. In the first step, the patch invariant to translation, scale and partial affine transformation is created. On this normalized patch, the dominant gradient is estimated. Both transformation are then used together in the transformation matrix A.

To estimate the orientations of the dominant gradients on the partially normalized regions (using $A_{norm}$), the gradient magnitudes and orientations are estimated for each region and a histogram of these gradients is computed. While we assume a random rotation of the region, it is necessary to compute gradients only in circular neighborhood. This can be done by weighting with a window-function (e.g. Gaussian) or with a condition like $(x-x_{center})^2+(y-y_{center})^2 < (ps/2)^2$, where $ps$ patch edge length. Otherwise the content of the corners would influence the orientation estimate undesirably. The orientations of gradients are weighted by their magnitudes. This means that a “stronger” gradient brings more to the corresponding bin of the histogram. For improved robustness, we can use linear interpolation to vote into the closest neighboring bins. At the end, the histogram is filtered with a 1D Gaussian and the maximum is found. For a more precise localization it is possible to fit a parabola into the maximum neighbourhood and to find the orientation with a precision over 360/(number of bins) degrees.

  • Write a function angle=estimate_patch_dominant_orientation(patch) for dominant orientation estimation in a normalized patch. The input img is a partially normalized patch of the image (B x 1 PS x PS matrix of floats), the output is the angle measured from the x-axis of the image in clock-wise direction (angle for vector (x,y) can be computed with function torch.atan2(y,x)).

Now we are able to write the functions for geometric normalization of feature point neighborhoods with dominant orientations:

  • for a known position, write a function A,img_idxs=affine_from_location(keypoint_location), which output affine transformation A for patch extraction,img is the input image BxChXHxW, keypoint_location is [Bx5] is keypoint locations, as output from scalespace_hessian function.
  • for a known position and orientation , write a function A,img_idxs=affine_from_location_and_orientation(keypoint_location, orientation), where img is the input image BxChXHxW, keypoint_location is [Bx5] is keypoint locations, as output from scalespace_hessian function and orientation is [Bx1] angle in radians.

The output A and img_idx are input arguments for the function extract_affine_patches we implemented in the first lab.

Here's a pseudocode for normalization including orientation estimation:

A, img_idxs = affine_from_location(keypoint_locations)
patches =  extract_affine_patches(img, A, img_idxs, 19, 3.0)
ori = estimate_patch_dominant_orientation(patches)
A, img_idxs = affine_from_location_and_orientation(keypoint_locations, ori)
For the sake of clarity only the orientation of the strongest gradient is used for normalization in this task.

Affine covariance, Baumberg iteration

(optional study material for interested students)

The detection of similarity-covariant points, as maxima of Hessian in scale space, can be extended to affine-covariant points to an unknown rotation. It is based on the so called Baumberg iteration, the idea is to observe the distribution of gradients in the vicinity of the detected point. Assuming that there is a transformation, which “flattens” the intensity in the point's neighborhood (weighted average over the patch) in such a way that the distribution of gradients would be isotropic (with gradients distributed equally in all directions), the transformation can be found using the second moment matrix of gradients, i.e. the autocorrelation matrix of gradients that we already know.

$$ \mathbf{C}(x,y;\sigma_d,\sigma_i)=G(x,y;\sigma_i)*\left[\begin{array}{cc} D^2_{x}(x,y;\sigma_d) & D_x D_y(x,y;\sigma_d)\\ D_x D_y(x,y;\sigma_d) & D^2_y(x,y;\sigma_d) \end{array}\right]$$

The formula above is given for the general case, when the calculation is done directly in the image and $\sigma_i, \sigma_d$ corresponds to the feature scale. Because for the assignment you are working on the extracted patches, which already account for the feature scale, $\sigma_i$ means a simple weighted average over the patch.

As we have said earlier, this matrix reflects the local distribution of gradients, we can show that when we find a transformation $\mathbf{\mu} = \mathbf{C}^{-1/2}$, where $\mathbf{C}$ represents the whole patch, i.e. pooled over per-pixel C matrices by the window function. $\mathbf{\mu}$ translates the coordinate system given by eigenvectors and eigenvalues of ​​matrix $\mathbf{C}$ to a canonical one, the distribution of gradients in the transformed matrix $\mathbf{\mu}$ will be “more isotropic”. However, since we used a circular window function it is possible that for some gradients around, the weight will be calculated wrongly, or not at all. Therefore, it is necessary to repeat this procedure until the eigenvalues ​​of matrix $\mathbf{C}_i$ from the i-th iteration of the point's neighborhood will be close to identity (a multiplication of identity). We'll find out by comparing the ratio of eigenvalue of matrix $\mathbf{C}_i$. The resulting array of local affine deformation is obtained as the total deformation of the neighborhood needed to “become” isotropic:

$$N = \prod_i \mu_i$$

This transformation leads (after the addition of translation and scaling) from the image coordinates to the canonical coordinate system. In practice, we are mostly interested in the inverse transformation $\mathbf{A}=\mathbf{N}^{-1}$.

  • Write a function affshape = estimate_patch_affine_shape (patch), which calculates one step of the Baumberg iteration for the patch. The function returns Bx3 ellipse parameters affshape C. The order is [c11, c12, c22], as the C matrix is symmetrical.
  • for a known position, orientation and affine shape , write a function A,img_idxs=affine_from_location_and_orientation_and_affshape(keypoint_location, orientation, affshape), where img is the input image BxChXHxW, keypoint_location is [Bx5] is keypoint locations, as output from scalespace_hessian function, orientation is [Bx1] angle in radians, and affshape is [Bx3] matrix of ellipse parameters [a,b,c]. Hint: convert C to 2×2 submatrix A1 matrix by the following formula. Note, that square root and inversion are matrix expressions, not element-wise. $$A1=\sigma_i \, inv\left(C^{-1/2}/\sqrt{det(C^{-1/2})})\right)$$.

Photometric normalization

The goal of photometric normalization is to suppress the scene changes caused by illumination changes. The easiest way of normalization is to expand the intensity channel into whole intensity range. We can also expand the intensity channel such that(after transformation) the mean intensity is at half of the intensity range and the standard deviation of intensities \sigma corresponds to it's range (or better 2\times\sigma). In case we want to use all three color channels, we can normalize each channel separately.

  • Write a function patchnorm=photonorm(patch), which normalizes the patches such that the mean intensity value per channel will be 0 and the standard deviation will be 1.0. Values outside the range < -3,3> will be set to -3 or 3 respectively. Which deep learning normailzation layer can be used for this operation: InstanceNorm, GroupNorm, BatchNorm or LayerNorm? Why?

Computing the description

On the geometrically and photometrically invariant image patch from the last step, any low dimensional description can be computed. It is clear that the process of normalization is not perfect and therefore it is desirable to compute a description which is not very sensitive to the residual inaccuracies. The easiest description of the patch is the patch itself. Its disadvantages are high dimensionality and sensitivity to small translations and intensity changes. Therefore, we will try better descriptions too.

Write a function, calc_sift_descriptor(input, num_ang_bins, num_spatial_bins, clipval) which computes SIFT descriptor of the patch

SIFT descriptor.

Image source: VLFeat SIFT tutorial

Here is a description from the original SIFT paper

A keypoint descriptor is created by first computing the gradient magnitude and orientation (Hint: you can use spatial_gradient_first_order from the first lab) at each image sample point in a region around the keypoint location, as shown on the left. These are weighted by a Gaussian window, indicated by the overlaid circle. These samples are then accumulated into orientation histograms summarizing the contents over 4×4 subregions, as shown on the right, with the length of each arrow corresponding to the sum of the gradient magnitudes near that direction within the region.

This figure above shows a 2×2 descriptor array computed from an 8×8 set of samples, whereas the experiments in this paper use 4×4 descriptors computed from a 16×16 sample array. After this, descriptor is L2-normalized, maximum is clipped to clip_val and L2-normalized again.

Do not forget to weight the magnitudes by the distance to the subpatch center for the good performance, as shown below. Also, subpatches should be overlapping, see picture below from the

What should you upload?

Local descriptors.

To fulfil this assignment, you need to submit these files (all packed in one .zip file) into the upload system:

  • local_descriptor.ipynb - a notebook for data initialisation, calling of the implemented functions and plotting of their results (for your convenience, will not be checked).
  • - file with the following methods implemented:
    • affine_from_location - computes transformation matrix A which transforms point in homogeneous coordinates from canonical coordinate system into image from keypoint location (output of scalespace_harris or scalespace_hessian). Matrix A is the same, as in first assignment. Output of scalespace_hessian is of shape [N x 5], where N - is total number of detector features.

And 5 is b_ch_sc_y_x, where b is image index in batch, ch - channel index (not used), sc - scale, y - height, x - width index.

  • affine_from_location_and_orientation - computes transformation matrix A which transforms point in homogeneous coordinates from canonical coordinate system into image from keypoint location (output of scalespace_harris or scalespace_hessian) and orientation
  • estimate_patch_dominant_orientation, - Function, which estimates the dominant gradient orientation of the given patches, in radians.
  • photonorm function, which normalizes the patch intensity
  • calc_sift_descriptor - functions for SIFT descriptor calculation
  • Optional:
    • affine_from_location_and_orientation_and_affshape, - computes transformation matrix A which transforms point in homogeneous coordinates from canonical coordinate system into image from keypoint location (output of scalespace_harris or scalespace_hessian), orientation and affine_shape
    • estimate_patch_affine_shape - function, which estimates the patch affine shape by second moment matrix
  • - file with implemented functions from the previous assignment
  • - file with implemented functions from the previous assignment

Use template of the assignment. When preparing a zip file for the upload system, do not include any directories, the files have to be in the zip file root.

Python and PyTorch Development


3. Tentative correspondences and RANSAC


Before we start, we'll need a function keypoint_locations, descs, A = detect_and_describe(img, det, th, affine, PS) connecting the previously implemented functions together, such that it detects, geometrically and photometrically normalizes and describes feature points in the input image img. You can find function detect_and_describe in the this notebook.

Matching: tentative correspondences

After detection and description, which run on each image of the scene separately, we need to find correspondences between the images. The first phase is looking for the tentative correspondences (sometimes called “matches”). A tentative correspondence is a pair of features with descriptions similar in some metric. From now on, we will refer to our images of the scene as the left one and the right one. The easiest way is to establish tentative correspondences is to find all distances between the descriptions of features detected in the left and the right image. We call this table a “distance matrix”. We will compute the distance in Euclidean space. Let's try several methods for correspondence selection:

  • All tentative correspondences without any filtering. Write a function matches_idxs, match_dists=match_nn(descs1, descs2), which calculated distance matrix and select nearest neighbor for each descriptor in descs1.

Nearest neighbor strategy. Features from img1 (blue circles) are matched to features from img2 (red squares). You can see, that it is asymmetric and allowing “many-to-one” matches.

  • Tentative correspondences of mutually nearest descriptions are created, when the nearest description in the left image for description A from right image is description B and at the same time, description A is the closest description in the right image to description B from the left image. Write a function matches_idxs, match_dists=match_mnn(descs1, descs2), which implements this strategy.

Mutual nearest neighbor strategy. Features from img1 (blue circles) are matched to features from img2 (red squares). Only cross-consistent matches (green) are retained.

  • A different procedure is to find the two closest descriptions in the right image for each description from the left image and then test, if the ratio of first/second closest (SNN) description is smaller than a threshold (use e.g. 0.8). If so, the descriptions correspond and it is guaranteed, that there are no confusing descriptions which would be too close in the description space. This method can be used for well discriminative description - for instance SIFT. Write a function matches_idxs, match_dists=match_snn(descs1, descs2, th), which implements this strategy.

Second nearest ratio strategy. Features from img1 (blue circles) are matched to features from img2 (red squares). For each point in img1 we calculate two nearest neighbors and check their distance ratio . If both are too similar (>0.8, bottom at Figure), then the match is discarded. Only confident matches are kept

  • Combination of the two previous approaches: mutual nearest neighbors, that also survive SNN check in both directions. Write a function matches_idxs, match_dists=match_smnn(descs1, descs2, th), which implements this strategy. The resulting distance ratio should be maximum over distance ratio in both directions. (optional bonus task)

The output of finding tentative correspondences are pairs of indices of features from the left and the right image with the closest descriptions.

First output will be a Nx2 matrix containing the indices of tentative correspondences, where the first row corresponds to indices in descs1 and second row to indices in descs2. Second output will the correspondence quality Nx1 matrix: descriptor distance for match_nn and match_mnn, and descriptor ditance ratio for match_snn and match_smnn.

Please, make sure to test your code for the corner-cases. Specifically, for situations, where the correct answer for matching would be “there is no match”.

Additional reading on matching strategies

RANSAC and the geometric model of the scene

The tentative correspondences from the previous step usually contain many non-corresponding points, so called outliers, i.e. tentative correspondences with similar descriptions that do not correspond to the same physical object in the scene. The last phase of finding correspondences aims to get rid of the outliers. We will assume images of the same scene and search a model of the undergoing transformation. The result will be a model of the transformation and a set of the so called inliers, tentative correspondences that are consistent with the model.

The model of a planar scene

The relation between the planes in the scene is discussed in the course Digital Image, (look at the document on searching of the projective transformation), or Geometry for Computer Vision. In short:

The transformation between two images of a plane observed from two views (left and right image) using the perspective camera model is a linear transformation (called homography) of 3-dimensional homogeneous vectors (of corresponding points in the homogenoeous coordinates) represented by a regular $3\times 3$ matrix $\mathbf{H}$:

$$ \lambda\left[\!\begin{array}{c}x'\\y'\\1\end{array}\!\right]= \left[\begin{array}{ccc}h_{11}&h_{12}&h_{13}\\h_{21}&h_{22}&h_{23}\\h_{31}&h_{32}&h_{33}\end{array}\right] \left[\!\begin{array}{c}x\\y\\1\end{array}\!\right]% $$

For each pair of corresponding points $(x_i,y_i)^{\top} \leftrightarrow (x'_i,y'_i)^{\top}$ it holds:

$$\begin{array}{lcr} x' (h_{31}\,x + h_{32}\,y + h_{33}) - (h_{11}\,x + h_{12}\,y + h_{13}) & = & 0 \\ y' (h_{31}\,x + h_{32}\,y + h_{33}) - (h_{21}\,x + h_{22}\,y + h_{23}) & = & 0 \end{array}$$

Let us build a system of linear constraints for each pair:

$$ \underbrace{\left[\begin{array}{r@{\ }r@{\ }rr@{\ }r@{\ }rr@{\ \ }r@{\ \ }r} -x_1&-y_1&-1&0&0&0&x_1'x_1&x'_1y_1&x_1'\\0&0&0&-x_1&-y_1&-1&y_1'x_1&y'_1y_1&y_1'\\ -x_2&-y_2&-1&0&0&0&x_2'x_2&x'_2y_2&x_2'\\0&0&0&-x_2&-y_2&-1&y_2'x_2&y'_2y_2&y_2'\\ &\vdots&&&\vdots&&&\vdots&\\ -x_n&-y_n&-1&0&0&0&x_n'x_n&x'_ny_n&x_n'\\0&0&0&-x_n&-y_n&-1&y_n'x_n&y'_ny_n&y_n'\\ \end{array}\right]}_\mathbf{C} \left[\begin{array}{c}h_{11}\\h_{12}\\h_{13}\\h_{21}\\h_{22}\\h_{23}\\h_{31}\\h_{32}\\h_{33}\end{array}\right]=0 $$

This homogeneous system of equations has 9 degrees of freedom and we get one non-trivial solution as a right nullspace of the matrix \mathbf{C}. For a unique solution we need at least 8 linearly independent rows of matrix \mathbf{C}, i.e. 8 constraints from 4 corresponding points in a general position (no triplet of points can lie on a line!).

  • implement a function H=getH(min_sample) that gets a matrix of 4 corresponding pairs of points and computes the homography \mathbf{H}. In case that some of the triplets will lie close to a line, it returns a None value. The input u is a 4×4 matrix as shown below, where xi,yi are point coordinates in the left image and x'i,y'i in the right image. :

$$ u = \left[\begin{array}{cccc} x_1 & y_1 & x_1' & y_1' \\ x_2 & y_2 & x_2' & y_2' \\ x_3 & y_3 & x_3' & y_3' \\ x_4 & y_4 & x_4' & y_4' \end{array}\right] $$ Do not forget to normalize by the bottomright element of the homography matrix, such that $H = [[h_{11}, h_{12}, h_{13}], [h_{21}, h_{22}, h_{23}], [h_{31}, h_{32}, 1]]$. If you use torch.svd, remember that by default it returns [8×9] matrices instead of [9×9], so pass some=False.

  • write a function dist=hdist(H,pts_matches), which gets a Nx4 matrix pts_matches of N pairs of points in Euclidean coordinates and returns a vector dist (1xN) of squared distances of the points $\mathbf{H} (x_i, y_i,1)^\top$ and $(x'_i, y'_i,1)^\top$ in Euclidean space.


To find a correct model of the scene – a projective transformation between two planes, we need at least 4 inlier point pairs (not in a line). One way of picking such points in a robust way from all the pairs of acquired tentative correspondences is the RANSAC algorithm of M.Fischler and R.C.Bolles.

In our task, the RANSAC algorithm will have this form:

  1. Uniformly sample 4 corresponding pairs. Write a function sample(pts_matches, num), which takes Nx4 matrix and randomly selects num points out of it.
  2. Compute the homography $\mathbf{H}$ out of the sampled point pairs.
  3. Compute the support of the found model. If the transformation $\mathbf{H}$ was computed from an all-inlier sample, all inlier points from the left image $(x_i,y_i)$ shall be projected by $\mathbf{H}$ to the corresponding points $(x'_i,y'_i)$ in the other image. The support of the model is quantified by the number of points consistent with the model, i.e. the number of correspondences with distance (function hdist) of transformed point from the left image and the corresponding point in the right image smaller than a predefined threshold (e.g. 1-3 pixels).
  4. If the support of the model is the biggest so far, we keep this so-far-the-best homography $\mathbf{H}^*$ and the corresponding set of inliers, i.e points that supported this model.
  5. We iterate points 1-4 and keep track of so-far-the-best $\mathbf{H}^*$ transformation.

Set the number of iterations to 1000 in the beginning and debug the RANSAC algorithm on a simple pair of images. A more sophisticated stopping criterion is used in practice. It is based on the probability estimate of finding an all-inlier sample (of size 4 in our case) under the assumptions of iteratively estimated fraction of inliers. - Implement the stopping criterium is implemented function new_max_iter = nsamples(n_inl, num_tc, sample_size, conf) where n_inl is a number of inliers, num_tc is a number of tentative correspondences and sample_size is 4. $$ \text{new_max_iter} = \frac{\log(1 - \text{conf})}{\log(1 - \text{inl_ratio}^{\text{sample_size}})} $$

  • use the provided functions to implement s function Hbest,inl=ransac_h(pts_matches,threshold,confidence,max_iter) that robustly estimates the homography $\mathbf{H}^*$ and a set of inliers inl. Input pts_matches is a Nx4 matrix of corresponding points in homogeneous coordinates (similar to function hdist). The parameter threshold is the maximum distance of reprojected points (computed via hdist) that will be marked as inliers (pairs consistent with the model $\mathbf{H}^*$). The parameter confidence is a value from the range (0,1), the requested confidence of getting a correct sample (a good value for testing is for example 0.95).
    The output is the best model $\mathbf{H}^*$ and an array inl (Nx1), where 0 represents an outlier and 1 represents an inlier.

What should you upload?

Local feature detection.

To fulfil this assignment, you need to submit these files (all packed in one .zip file) into the upload system:

  • matching_and_ransac.ipynb - a notebook for data initialisation, calling of the implemented functions and plotting of their results (for your convenience, will not be checked).
  • - file with the following methods implemented:
    • match_nn, match_mnn, match_snn, match_smnn - functions for matching descriptors
  • - file with the following methods implemented:
    • hdist - function for reprojection error calculation
    • sample - function for random sampling correspondences for H or F calculation
    • getH - function which computes H from 4 correspondences
    • nsamples - function, which updates maximum number of iteration needed for given confidence level
    • ransac_h - function, which robustly estimates homography from noisy correspondences
  • - file with implemented functions from the previous assignment
  • - file with implemented functions from the previous assignment
  • - file with implemented functions from the previous assignment

Use template of the assignment. When preparing a zip file for the upload system, do not include any directories, the files have to be in the zip file root.

Task related to the testing implemented functions on your own dataset will be added soon, please check updates


To test your code use the provided notebook notebook.


In the tournament, you should implement a function matchImages, which takes two images (in form of torch.Tensor) and output homography transform, which maps coordinates in image 1 into image 2. All the parameters are up to participants. You are free in choosing the approach you would like to apply: integration and tuning the things you have done in the previous labs, or developing completely different algorithm, e.g. based on photometric error minimization. Important: you cannot use a solution from 3rd party library (by import or by copy-paste), but should develop it yourself.


Evaluation metric is mean average accuracy. Specifically, evaluation is done as following: Corner points (0,0), (0,h), (w, h), (w,0) are reprojected to the second image via estimated and ground truth homography. The average absolute error over 4 point is calclulated (MAE). See image below for the illustration. Image credit: Christiansen, 2019, UnsuperPoint

Mean average error on image boundaries. Image from UnsuperPoint paper

For accuracy threshold in th in [1.0, 2.0, 5.0, 10., 15., 20] percentage of image pairs, where MAE < th is calculated, resulting in average accuracy Ai. Finally, all per-threshold accuracies Ai are averaged into final score.

The number of image pairs in test dataset: around 40. Timeout: 750 sec. The number of points you get is max(7, 17 - your rank). So, 1st place gets 17 bonus points, 2nd place - 9 and so on. All participants get at least seven points if their solution produces non-zero accuracy.

If you submit after deadline, you cannot get the bonus points at all. The regular points (7) are still available - with standard penalty rules.

Local evaluation and debugging

For local evaluation you could use get_MAE_imgcorners function from the provided template. We recommend to use Oxford-Affine dataset for local validation purposes.

Important! Images are loaded with img = kornia.image_to_tensor(cv2.imread(fname,0),False).float()/255., so their range is [0,1.0], not [0,255]

What should you upload?

Wide baseline stereo tournament.

To fulfill this assignment, you need to submit these files (all packed in one .zip file) into the upload system:

  • - file with all functions you need.
  • weights.pth - Optional. If you have some trained CNN or another model, you could store parameters in this file

Use template of the assignment. When preparing a zip file for the upload system, do not include any directories, the files have to be in the zip file root.

Remember that you are not allowed to use any 3rd party implementations of detectors, descriptors, matchers, RANSACs and so on. However, you may use utility functions, like spatial gradient, bluring, nms.

courses/mpv/labs/2_correspondence_problem/start.txt · Last modified: 2023/01/24 14:00 (external edit)