Search
The goal of this lab is to use deep metric learning to learn an image representation to perform image retrieval. The network architectures and losses from the lecture will be used. We will train a deep CNN to extract global image descriptors by using the triplet loss or the contrastive loss. Our starting point is a ResNet18 network that is pre-trained on ImageNet for classification. We will further train the network, i.e. fine-tune its weights to perform retrieval of bird species on the CUB-200-2011 dataset. The requirement is to implement the network architecture, the loss, and the triplet sampling with hard-negative mining.
Training, validation and testing is performed on the CUB-200-2011 dataset comprised images of 200 bird species. In contrast to classification tasks, training, validation, and testing is performed on 3 non-overlapping sets of classes, i.e. testing for retrieval is performed on categories that are not seen during the training.
The network architecture consists of a fully convolutional part, the global pooling operation and vector L2 normalization. The first part maps the image to a 3D activation tensor of size $W\times H \times D$, the second part maps the 3D tensor to a vector of $D$ dimensions, and the third part just normalizes the mangitude of the vector.
Keep the fully convolutional part of ResNet18, with weights that are pre-trained on ImageNet for classification, and add two options for the global pooling operation, i.e. average pooling and max pooling.
GDextractor
model = GDextractor(input_model = resnet18_model, dim = 512, usemax=True)
forward
img
vec = model(img)
vec = model.forward(img)
usemax
A triplet consists of an anchor image $a$, an image $p$ that is positive to the anchor, and an image $n$ that is negative to the anchor. Their respective global descriptors are denoted by $\mathbf{x}_a$, $\mathbf{x}_p$, and $\mathbf{x}_n$, respectively. The triplet loss is formulated by
$$ l(a,p,n) = \lVert \mathbf{x}_a - \mathbf{x}_p \rVert_2^2 - \lVert \mathbf{x}_a - \mathbf{x}_n \rVert_2^2 + \mu,$$
where $\mu$ is the margin hyper-parameter.
triplet_loss(distances_pos, distances_neg, margin)
distances_pos
distances_neg
Mini-batch construction is performed by selecting $M$ triplets of the form $(a,p,n)$, where for a given anchor $a$, image $p$ is positive example that is randomly selected from the same class as $a$, and image $n$ is a hard-negative. Image $n$ is randomly sampled from the top-30 nearest neighbors of $a$. The neighbors are estimated among the whole training set by Euclidean distance of the global descriptors of the current network and are updated at the beginning of every epoch.
minehard()
CUBtriplet
hardneg
hardneg[i]
Training is performed with batches of $M$ triplets and an epoch is considered the use of all training images as anchors exactly once. The optimizer, the training augmentations and some indicative hyper-parameter values that work for our version are included in the provided code template. Some small performance improvements are observed even after the first 1-2 epochs, but for larger improvements you might need to train much longer which makes the use of GPU necessary. We measured the time for one epoch to be around 15 minutes on an Intel Xeon Scalable Gold 6150 CPU. Each batch includes: (i) forward pass to extract global descriptors and estimate the loss, (ii) backward pass on the average (over triplets) loss, and (iii) parameter and update and setting gradients to zero for the next batch.
Validation is performed by Euclidean-distance-based retrieval and the precision at the top-ranked image is measured and reported. Testing is performed on our servers after your model submission on a held out test set.
As a reference: SPoC descriptor before any fine-tuning achieves 52.0 precision@1 on the validation set, and 52.0 (coincidentally) precision@1 on the held-out test set.
We are providing the overall pipeline in file dml.py that can be found here with some missing parts to be written by you (marked by “your code” in comments). Upload this script after adding your modifications. The CUB dataset can be download from its original source. Additionally train the network locally with triplet loss and hard-negative mining using the indicated hyper-parameter values in the template (or your preferred values if you like) and check if it is correctly improving. The performance of your training implementation will be evaluated on the test set and part of the credits are assigned if improved performance is detected.
dml.py