Search
Searching for correspondences is a fundamental task in computer vision. The goal is to find corresponding parts (patches) of a scene in two or more images. Finding correspondences on a pixel level is often ill-posed, and instead correspondences for selected larger image portions (patches) are sought. This comprises of first detecting, in each image, the local features (a.k.a. interest points, distinguished regions, covariant regions) i.e. regions detectable in other images of the same scene with high degree of geometric and photometric invariance. Correspondences are then sought between these local features.
For each feature, the exact position and a description of the feature neighborhood is saved. The description can be a vector of pixel intensities, histogram of intensities, histogram of gradients, etc. The choice of a descriptor influences the discrimination and invariance of the feature. Roughly speaking, local feature descriptor is a vector, which characterises the image function in a small, well localized neighborhood of a feature point.
When establishing correspondences, for each feature in one image we look for features in the second image that have a similar description. We call these correspondences “tentative”, as some of them are usually wrong – they do not associate the same points in the scene. This happens because of noise, repeated structures in the scene, occlusions or inaccuracies of models of geometric and photometric transformation.
As a next step we need a robust algorithm, which (based on the knowledge of the image-to-image transformation model) will choose the largest subset of correspondences which are all consistent with the model. Such correspondences are called „verified“ or „inliers“ (with the model). An algorithm widely used for this purpose is called RANSAC (RANdom Sampling Consensus) and will be introduced it in the last section.
This can be implemented with a few lines of code using OpenCV library (see this notebook). We omit here data load and reshape for clarity:
det = cv2.AKAZE_create() kps1, descs1 = det.detectAndCompute(img1,None) kps2, descs2 = det.detectAndCompute(img2,None) matcher = cv2.DescriptorMatcher_create(cv2.DescriptorMatcher_BRUTEFORCE_HAMMING) knn_matches = matcher.knnMatch(descriptors1, descriptors2, 2) ratio_thresh = 0.8 tentative_matches = [] for m,n in knn_matches: if m.distance < 0.8 * n.distance: tentative_matches.append(m) src_pts, dst_pts = split_pts(tentative_matches, kps1, kps2) H, inlier_mask = cv2.findHomography(src_pts, dst_pts, cv2.RANSAC,1.0)
In this course we will implement all these components from scratch using PyTorch and numpy
Feature detection is the first step in the process. We will implement two often used feature detectors. Detector of Hessian maxima and Harris detector.
Harris detector detects locations in which the local autocorrelation function is changing in all directions (therefore, the image gradient direction is changing there). For this reason, it is often called a corner detector. It computes the autocorrelation matrix, $\mathbf{M}(x,y;\sigma_d;\sigma_i)$:
$$ \mathbf{M}(x,y;\sigma_d,\sigma_i)=G(x,y;\sigma_i)*\left[\begin{array}{cc} D^2_{x}(x,y;\sigma_d) & D_x D_y(x,y;\sigma_d)\\ D_x D_y(x,y;\sigma_d) & D^2_y(x,y;\sigma_d) \end{array}\right]$$
where * is the convolution operator. This is weighted averaging (windowing function $G(x,y;\sigma_i)$) of outer products of gradients. If both eigenvalues of $\mathbf{M}(x,y;\sigma_d;\sigma_i)$ are large, and are of comparable value then the pattern inside the windowing function has stable position and we want to use it for matching. Harris and Stephens select such points based on corner response:
$$ R(x,y) = \mathrm{det}(\mathbf{M})-\alpha\mathrm{trace}^2(\mathbf{M}). $$
where the advantage is that eigenvalues $\lambda_1,\lambda_2$ of matrix $\mathbf{M}$ do not have to be explicitly computed because
$$ \begin{array}{rcl} \mathrm{det}(\mathbf{M})\!&=&\!\lambda_1\lambda_2\\ \mathrm{trace}(\mathbf{M})\!&=&\!\lambda_1 + \lambda_2. \end{array} $$ Note however that explicit computation of eigenvalues would not be a problem with todays computers (how would you do it?)
The image shows the function $R(x,y)$ for $\alpha=0.04$, where white lines show iso-contours with values 0, 0.02, 0.04, 0.06, …, 0.18. When we want to have values greater than a threshold, we want the smaller of the two eigenvalues to be greater than the threshold. We also want to have the value of the greater eigenvalue greater if the smaller eigenvalue is smaller. In the case, when the both eigenvalues are similar, they can be smaller.
harris_response(img,
,
)
keypoint_locations=harris(img,
,thresh)
maximg=nms2d(response, threshold)
The basic version of Harris or Hessian detector needs one important parameter: the scale on which it estimates gradients in the image and detects “blobs”. It can be shown that the scale can be estimated automatically for each image. For this purpose, it is necessary to build the “scale-space” – a three dimensional space, in which two dimensions are x,y-coordinates in the image and the third dimension is the scale. The image is filtered to obtain its new versions corresponding to increasing scale. Suppressing the details simulates the situation when we are looking at the scene from greater distance.
With increasing suppression of details the variance of intensities is decreasing. Therefore, selection of characteristic scale is necessary to have comparable gradients between two scales. The term “normalized derivative” (considering the “distance” between pixels) was introduced for this purpose to obtain a “scaleless” gradient.
$$ \dfrac{\partial f}{\partial \xi} = \dfrac{\partial f}{\partial (x/\sigma)} = \sigma \dfrac{\partial f}{\partial x}\mbox{,}\quad N_x(x,y;\sigma) = \sigma D_x(x,y;\sigma)\mbox{,}\quad N_{xx}(x,y;\sigma) = \sigma^2 D_{xx}(x,y;\sigma)\mbox{,} $$
The normalized autocorrelation matrix is thus:
$$ \mathbf{C_{norm}}(x,y;\sigma_d,\sigma_i)=\sigma_d^2 G(x,y;\sigma_i)*\left[\begin{array}{cc} D^2_{x}(x,y;\sigma_d) & D_x D_y(x,y;\sigma_d)\\ D_x D_y(x,y;\sigma_d) & D^2_y(x,y;\sigma_d) \end{array}\right]$$
ss,sigma=create_scalespace(img,levels,step)
ss[:,:,0,:,:]
ss[:,:,i,:,:]
maximg=nms3d(response, threshold)
[hes,sigma]=scalespace_harris_response(img)
create_scalespace
harris_response
scalespace
nms3d
scalespace_harris_response
keypoint_locations=scalespace_harris(img, thresh)
scalespace_harris
# get the matrix of maxima nmsed=nms3d(response, thresh); # find positions loc = torch.nonzero(nmsed) # Don't forget to convert scale index to scale value with use of sigma ...
For visualization of detected points use the function visualize_detections, defined in the local_detector.ipynb:
Local feature detection.
To fulfil this assignment, you need to submit these files (all packed in one .zip file) into the upload system:
.zip
local_detector.ipynb
local_detector.py
nms2d
harris
imageprocessing.py
Use template of the assignment. When preparing a zip file for the upload system, do not include any directories, the files have to be in the zip file root.
The task of a detector is to reliably find feature points and their neighborhoods such that the detected points are covariant with a desired class of transformations (geometric or photometric). For instance, the basic version of Harris or Hessian detector detects points which are covariant with translation (if you shift the image, detected points shift the same amount). Hessian in the scale-space or DoG detector adds information about the scale of the region and therefore points are co-variant with similarity transformation up to an unknown rotation (detected points have no orientation assigned). Affine covariant detectors like MSER, Hessian-Affine or Harris-Affine are co-variant with affine transformations. The following text describes how to handle geometric information from the point neighbourhood and how to use it for normalization. Depending on the amount of information that a detector gives about a features, the descriptions can be invariant to translation, similarity, affine or perspective transformation.
Geometric normalization is a process of geometric transformation of feature neighborhood into a canonical coordinate system. Information about geometric transformation will be stored in the form of “frame” – a projection of the canonical coordinate system into the neighborhood of a feature point (or region) in the image. The frame will be represented by a 3×3 matrix A, the same as in the first lab. The transformation matrix A is used to obtain a “patch” – a small feature neighborhood in the canonical coordinate system. All other measurements on this square patch of original image are now invariant to desired geometric transformation.
Another example from Wikipedia: Effect of applying various 2D affine transformation matrices on a unit square. Note that the reflection matrices are special cases of the scaling matrix.
For the construction of a transformation A, we use the geometric information about position and shape of point neighborhood:
Many detectors (all from the last lab) detect points (frames), which are similarity- or affine- covariant up to an unknown rotation (angle $\alpha$). For instance, the position and scale give us a similarity co-variant point up to unknown rotation. Similarly, the matrix of second moments and the centre of gravity give us five constraints for an affine transformation and only the rotation remains unknown. To obtain a fully similarity- or affine-covariant point, we need to define orientation $\alpha$ from a partially geometrically normalized point neighborhood.
The normalization including orientation estimation has to be done in two steps. In the first step, the patch invariant to translation, scale and partial affine transformation is created. On this normalized patch, the dominant gradient is estimated. Both transformation are then used together in the transformation matrix A.
To estimate the orientations of the dominant gradients on the partially normalized regions (using $A_{norm}$), the gradient magnitudes and orientations are estimated for each region and a histogram of these gradients is computed. While we assume a random rotation of the region, it is necessary to compute gradients only in circular neighborhood. This can be done by weighting with a window-function (e.g. Gaussian) or with a condition like $(x-x_{center})^2+(y-y_{center})^2 < (ps/2)^2$, where $ps$ patch edge length. Otherwise the content of the corners would influence the orientation estimate undesirably. The orientations of gradients are weighted by their magnitudes. This means that a “stronger” gradient brings more to the corresponding bin of the histogram. For improved robustness, we can use linear interpolation to vote into the closest neighboring bins. At the end, the histogram is filtered with a 1D Gaussian and the maximum is found. For a more precise localization it is possible to fit a parabola into the maximum neighbourhood and to find the orientation with a precision over 360/(number of bins) degrees.
angle=estimate_patch_dominant_orientation(patch)
torch.atan2(y,x)
Now we are able to write the functions for geometric normalization of feature point neighborhoods with dominant orientations:
A,img_idxs=affine_from_location(keypoint_location)
A,img_idxs=affine_from_location_and_orientation(keypoint_location, orientation)
The output A and img_idx are input arguments for the function extract_affine_patches we implemented in the first lab.
Here's a pseudocode for normalization including orientation estimation:
A, img_idxs = affine_from_location(keypoint_locations) patches = extract_affine_patches(img, A, img_idxs, 19, 3.0) ori = estimate_patch_dominant_orientation(patches) A, img_idxs = affine_from_location_and_orientation(keypoint_locations, ori)
(optional study material for interested students)
The detection of similarity-covariant points, as maxima of Hessian in scale space, can be extended to affine-covariant points to an unknown rotation. It is based on the so called Baumberg iteration, the idea is to observe the distribution of gradients in the vicinity of the detected point. Assuming that there is a transformation, which “flattens” the intensity in the point's neighborhood (weighted average over the patch) in such a way that the distribution of gradients would be isotropic (with gradients distributed equally in all directions), the transformation can be found using the second moment matrix of gradients, i.e. the autocorrelation matrix of gradients that we already know.
$$ \mathbf{C}(x,y;\sigma_d,\sigma_i)=G(x,y;\sigma_i)*\left[\begin{array}{cc} D^2_{x}(x,y;\sigma_d) & D_x D_y(x,y;\sigma_d)\\ D_x D_y(x,y;\sigma_d) & D^2_y(x,y;\sigma_d) \end{array}\right]$$
The formula above is given for the general case, when the calculation is done directly in the image and $\sigma_i, \sigma_d$ corresponds to the feature scale. Because for the assignment you are working on the extracted patches, which already account for the feature scale, $\sigma_i$ means a simple average over the patch.
As we have said earlier, this matrix reflects the local distribution of gradients, we can show that when we find a transformation $\mathbf{\mu} = \mathbf{C}^{-1/2}$, where $\mathbf{C}$ represents the whole patch, i.e. pooled over per-pixel C matrices by the window function. $\mathbf{\mu}$ translates the coordinate system given by eigenvectors and eigenvalues of matrix $\mathbf{C}$ to a canonical one, the distribution of gradients in the transformed matrix $\mathbf{\mu}$ will be “more isotropic”. However, since we used a circular window function it is possible that for some gradients around, the weight will be calculated wrongly, or not at all. Therefore, it is necessary to repeat this procedure until the eigenvalues of matrix $\mathbf{C}_i$ from the i-th iteration of the point's neighborhood will be close to identity (a multiplication of identity). We'll find out by comparing the ratio of eigenvalue of matrix $\mathbf{C}_i$. The resulting array of local affine deformation is obtained as the total deformation of the neighborhood needed to “become” isotropic:
$$N = \prod_i \mu_i$$
This transformation leads (after the addition of translation and scaling) from the image coordinates to the canonical coordinate system. In practice, we are mostly interested in the inverse transformation $\mathbf{A}=\mathbf{N}^{-1}$.
affshape = estimate_patch_affine_shape (patch)
[c11, c12, c22]
A,img_idxs=affine_from_location_and_orientation_and_affshape(keypoint_location, orientation, affshape)
The goal of photometric normalization is to suppress the scene changes caused by illumination changes. The easiest way of normalization is to expand the intensity channel into whole intensity range. We can also expand the intensity channel such that(after transformation) the mean intensity is at half of the intensity range and the standard deviation of intensities <latex>\sigma</latex> corresponds to it's range (or better <latex>2\times\sigma</latex>). In case we want to use all three color channels, we can normalize each channel separately.
On the geometrically and photometrically invariant image patch from the last step, any low dimensional description can be computed. It is clear that the process of normalization is not perfect and therefore it is desirable to compute a description which is not very sensitive to the residual inaccuracies. The easiest description of the patch is the patch itself. Its disadvantages are high dimensionality and sensitivity to small translations and intensity changes. Therefore, we will try better descriptions too.
Write a function, calc_sift_descriptor(input, num_ang_bins, num_spatial_bins, clipval) which computes SIFT descriptor of the patch
calc_sift_descriptor(input, num_ang_bins, num_spatial_bins, clipval)
.
Image source: VLFeat SIFT tutorial
Here is a description from the original SIFT paper
A keypoint descriptor is created by first computing the gradient magnitude and orientation (Hint: you can use spatial_gradient_first_order from the first lab) at each image sample point in a region around the keypoint location, as shown on the left. These are weighted by a Gaussian window, indicated by the overlaid circle. These samples are then accumulated into orientation histograms summarizing the contents over 4×4 subregions, as shown on the right, with the length of each arrow corresponding to the sum of the gradient magnitudes near that direction within the region.
spatial_gradient_first_order
This figure above shows a 2×2 descriptor array computed from an 8×8 set of samples, whereas the experiments in this paper use 4×4 descriptors computed from a 16×16 sample array. After this, descriptor is L2-normalized, maximum is clipped to clip_val and L2-normalized again.
clip_val
Do not forget to weight the magnitudes by the distance to the subpatch center for the good performance, as shown below. Also, subpatches should be overlapping, see picture below from the https://www.vlfeat.org/api/sift.html
Local descriptors.
local_descriptor.ipynb
local_descriptor.py
affine_from_location
affine_from_location_and_orientation
estimate_patch_dominant_orientation
calc_sift_descriptor
affine_from_location_and_orientation_and_affshape
estimate_patch_affine_shape
Python and PyTorch Development
Before we start, we'll need a function keypoint_locations, descs, A = detect_and_describe(img, det, th, affine, PS) connecting the previously implemented functions together, such that it detects, geometrically and photometrically normalizes and describes feature points in the input image img. You can find function detect_and_describe in the this notebook.
keypoint_locations, descs, A = detect_and_describe(img, det, th, affine, PS)
detect_and_describe
After detection and description, which run on each image of the scene separately, we need to find correspondences between the images. The first phase is looking for the tentative correspondences (sometimes called “matches”). A tentative correspondence is a pair of features with descriptions similar in some metric. From now on, we will refer to our images of the scene as the left one and the right one. The easiest way is to establish tentative correspondences is to find all distances between the descriptions of features detected in the left and the right image. We call this table a “distance matrix”. We will compute the distance in Euclidean space. Let's try several methods for correspondence selection:
Nearest neighbor strategy. Features from img1 (blue circles) are matched to features from img2 (red squares). You can see, that it is asymmetric and allowing “many-to-one” matches.
Mutual nearest neighbor strategy. Features from img1 (blue circles) are matched to features from img2 (red squares). Only cross-consistent matches (green) are retained.
matches_idxs, match_dists=match_snn(descs1, descs2, th)
Second nearest ratio strategy. Features from img1 (blue circles) are matched to features from img2 (red squares). For each point in img1 we calculate two nearest neighbors and check their distance ratio . If both are too similar (>0.8, bottom at Figure), then the match is discarded. Only confident matches are kept
The output of finding tentative correspondences are pairs of indices of features from the left and the right image with the closest descriptions.
First output will be a Nx2 matrix containing the indices of tentative correspondences, where the first row corresponds to indices in descs1 and second row to indices in descs2. Second output will the correspondence quality Nx1 matrix: descriptor distance ratio for match_snn.
descs1
descs2
match_snn
Please, make sure to test your code for the corner-cases. Specifically, for situations, where the correct answer for matching would be “there is no match”.
Additional reading on matching strategies
The tentative correspondences from the previous step usually contain many non-corresponding points, so called outliers, i.e. tentative correspondences with similar descriptions that do not correspond to the same physical object in the scene. The last phase of finding correspondences aims to get rid of the outliers. We will assume images of the same scene and search a model of the undergoing transformation. The result will be a model of the transformation and a set of the so called inliers, tentative correspondences that are consistent with the model.
The relation between the planes in the scene is discussed in the course Digital Image, (look at the document on searching of the projective transformation), or Geometry for Computer Vision. In short:
The transformation between two images of a plane observed from two views (left and right image) using the perspective camera model is a linear transformation (called homography) of 3-dimensional homogeneous vectors (of corresponding points in the homogenoeous coordinates) represented by a regular $3\times 3$ matrix $\mathbf{H}$:
$$ \lambda\left[\!\begin{array}{c}x'\\y'\\1\end{array}\!\right]= \left[\begin{array}{ccc}h_{11}&h_{12}&h_{13}\\h_{21}&h_{22}&h_{23}\\h_{31}&h_{32}&h_{33}\end{array}\right] \left[\!\begin{array}{c}x\\y\\1\end{array}\!\right]% $$
For each pair of corresponding points $(x_i,y_i)^{\top} \leftrightarrow (x'_i,y'_i)^{\top}$ it holds:
$$\begin{array}{lcr} x' (h_{31}\,x + h_{32}\,y + h_{33}) - (h_{11}\,x + h_{12}\,y + h_{13}) & = & 0 \\ y' (h_{31}\,x + h_{32}\,y + h_{33}) - (h_{21}\,x + h_{22}\,y + h_{23}) & = & 0 \end{array}$$
Let us build a system of linear constraints for each pair:
$$ \underbrace{\left[\begin{array}{r@{\ }r@{\ }rr@{\ }r@{\ }rr@{\ \ }r@{\ \ }r} -x_1&-y_1&-1&0&0&0&x_1'x_1&x'_1y_1&x_1'\\0&0&0&-x_1&-y_1&-1&y_1'x_1&y'_1y_1&y_1'\\ -x_2&-y_2&-1&0&0&0&x_2'x_2&x'_2y_2&x_2'\\0&0&0&-x_2&-y_2&-1&y_2'x_2&y'_2y_2&y_2'\\ &\vdots&&&\vdots&&&\vdots&\\ -x_n&-y_n&-1&0&0&0&x_n'x_n&x'_ny_n&x_n'\\0&0&0&-x_n&-y_n&-1&y_n'x_n&y'_ny_n&y_n'\\ \end{array}\right]}_\mathbf{C} \left[\begin{array}{c}h_{11}\\h_{12}\\h_{13}\\h_{21}\\h_{22}\\h_{23}\\h_{31}\\h_{32}\\h_{33}\end{array}\right]=0 $$
This homogeneous system of equations has 9 degrees of freedom and we get one non-trivial solution as a right nullspace of the matrix $\mathbf{C}$. For a unique solution we need at least 8 linearly independent rows of matrix <latex>\mathbf{C}</latex>, i.e. 8 constraints from 4 corresponding points in a general position (no triplet of points can lie on a line!).
H=getH(min_sample)
$$ u = \left[\begin{array}{cccc} x_1 & y_1 & x_1' & y_1' \\ x_2 & y_2 & x_2' & y_2' \\ x_3 & y_3 & x_3' & y_3' \\ x_4 & y_4 & x_4' & y_4' \end{array}\right] $$ Do not forget to normalize by the bottomright element of the homography matrix, such that $H = [[h_{11}, h_{12}, h_{13}], [h_{21}, h_{22}, h_{23}], [h_{31}, h_{32}, 1]]$. You can use torch.linalg.svd.
dist=hdist(H,pts_matches)
To find a correct model of the scene – a projective transformation between two planes, we need at least 4 inlier point pairs (not in a line). One way of picking such points in a robust way from all the pairs of acquired tentative correspondences is the RANSAC algorithm of M.Fischler and R.C.Bolles.
In our task, the RANSAC algorithm will have this form:
sample(pts_matches, num)
hdist
Set the number of iterations to 1000 in the beginning and debug the RANSAC algorithm on a simple pair of images. A more sophisticated stopping criterion is used in practice. It is based on the probability estimate of finding an all-inlier sample (of size 4 in our case) under the assumptions of iteratively estimated fraction of inliers. - Implement the stopping criterium is implemented function new_max_iter = nsamples(n_inl, num_tc, sample_size, conf) where n_inl is a number of inliers, num_tc is a number of tentative correspondences and sample_size is 4. $$ \text{new_max_iter} = \frac{\log(1 - \text{conf})}{\log(1 - \text{inl_ratio}^{\text{sample_size}})} $$
new_max_iter = nsamples(n_inl, num_tc, sample_size, conf)
Hbest,inl=ransac_h(pts_matches,threshold,confidence,max_iter)
matching_and_ransac.ipynb
matching.py
ransac.py
sample
getH
nsamples
ransac_h
Task related to the testing implemented functions on your own dataset will be added soon, please check updates
To test your code use the provided notebook notebook.
In the tournament, you should implement a function matchImages, which takes two images (in form of torch.Tensor) and output homography transform, which maps coordinates in image 1 into image 2. All the parameters are up to participants. You are free in choosing the approach you would like to apply: integration and tuning the things you have done in the previous labs, or developing completely different algorithm, e.g. based on photometric error minimization. Important: you cannot use a solution from 3rd party library (by import or by copy-paste), but should develop it yourself.
matchImages
Evaluation metric is mean average accuracy. Specifically, evaluation is done as following: Corner points (0,0), (0,h), (w, h), (w,0) are reprojected to the second image via estimated and ground truth homography. The average absolute error over 4 point is calclulated (MAE). See image below for the illustration. Image credit: Christiansen et.al, 2019, UnsuperPoint
For accuracy threshold in th in [1.0, 2.0, 5.0, 10., 15., 20] percentage of image pairs, where MAE < th is calculated, resulting in average accuracy Ai. Finally, all per-threshold accuracies Ai are averaged into final score.
The number of image pairs in test dataset: around 40. Timeout: 750 sec. The number of points you get is max(7, (17 - your rank)/2). So, 1st place gets 17 bonus points, 2nd place - 16.5 and so on. All participants get at least seven points if their solution produces non-zero accuracy.
If you submit after deadline, you cannot get the bonus points at all. The regular points (7) are still available - with standard penalty rules.
For local evaluation you could use get_MAE_imgcorners function from the provided template. We recommend to use Oxford-Affine dataset for local validation purposes.
get_MAE_imgcorners
Important! Images are loaded with img = kornia.image_to_tensor(cv2.imread(fname,0),False).float()/255., so their range is [0,1.0], not [0,255]
img = kornia.image_to_tensor(cv2.imread(fname,0),False).float()/255.
[0,1.0]
[0,255]
Wide baseline stereo tournament.
To fulfill this assignment, you need to submit these files (all packed in one .zip file) into the upload system:
wbs.py
weights.pth
Remember that you are not allowed to use any 3rd party implementations of detectors, descriptors, matchers, RANSACs and so on. However, you may use utility functions, like spatial gradient, bluring, nms.