This page is located in archive. Go to the latest version of this course pages. Go the latest version of this page.

Deep Metric Learning

The goal of this lab is to use deep metric learning to learn an image representation to perform image retrieval. The network architectures and losses from the lecture will be used. We will train a deep CNN to extract global image descriptors by using the triplet loss or the contrastive loss. Our starting point is a ResNet18 network that is pre-trained on ImageNet for classification. We will further train the network, i.e. fine-tune its weights to perform retrieval of bird species on the CUB-200-2011 dataset. The requirement is to implement the network architecture, the loss, and the triplet sampling with hard-negative mining.


Training, validation and testing is performed on the CUB-200-2011 dataset comprised images of 200 bird species. In contrast to classification tasks, training, validation, and testing is performed on 3 non-overlapping sets of classes, i.e. testing for retrieval is performed on categories that are not seen during the training.

Network architecture

The network architecture consists of a fully convolutional part, the global pooling operation and vector L2 normalization. The first part maps the image to a 3D activation tensor of size $W\times H \times D$, the second part maps the 3D tensor to a vector of $D$ dimensions, and the third part just normalizes the mangitude of the vector.

Keep the fully convolutional part of ResNet18, with weights that are pre-trained on ImageNet for classification, and add two options for the global pooling operation, i.e. average pooling and max pooling.

  • Implement a class GDextractor whose constructor is called as model = GDextractor(input_model = resnet18_model, dim = 512, usemax=True), which implements the forward function so as to extract the global descriptor of image img by vec = model(img), which is equivalent to vec = model.forward(img). If usemax is True, the model should extract MAC descriptors with global max pooling, otherwise SPoC descriptors with global average pooling.

Contrastive and triplet loss

A triplet consists of an anchor image $a$, an image $p$ that is positive to the anchor, and an image $n$ that is negative to the anchor. Their respective global descriptors are denoted by $\mathbf{x}_a$, $\mathbf{x}_p$, and $\mathbf{x}_n$, respectively. The triplet loss is formulated by

$$ l(a,p,n) = \lVert \mathbf{x}_a - \mathbf{x}_p \rVert_2^2 - \lVert \mathbf{x}_a - \mathbf{x}_n \rVert_2^2 + \mu,$$

where $\mu$ is the margin hyper-parameter.

  • Implement a function triplet_loss(distances_pos, distances_neg, margin) that gets as input the squared Euclidean distances between anchor- positive and between anchor-negative in $M$-dimensional vectors distances_pos and distances_neg, respectively, where $M$ is the number of the triplets in the batch.

Triplet sampling with hard-negative mining

Mini-batch construction is performed by selecting $M$ triplets of the form $(a,p,n)$, where for a given anchor $a$, image $p$ is positive example that is randomly selected from the same class as $a$, and image $n$ is a hard-negative. Image $n$ is randomly sampled from the top-30 nearest neighbors of $a$. The neighbors are estimated among the whole training set by Euclidean distance of the global descriptors of the current network and are updated at the beginning of every epoch.

  • Implement a function minehard() within the provided CUBtriplet class. Its result should be stored in class member hardneg, where hardneg[i] indicates the index of the hard-negative chosen for the i-th training image.


Training is performed with batches of $M$ triplets and an epoch is considered the use of all training images as anchors exactly once. The optimizer, the training augmentations and some indicative hyper-parameter values that work for our version are included in the provided code template. Some small performance improvements are observed even after the first 1-2 epochs, but for larger improvements you might need to train much longer which makes the use of GPU necessary. We measured the time for one epoch to be around 15 minutes on an Intel Xeon Scalable Gold 6150 CPU. Each batch includes: (i) forward pass to extract global descriptors and estimate the loss, (ii) backward pass on the average (over triplets) loss, and (iii) parameter and update and setting gradients to zero for the next batch.


Validation is performed by Euclidean-distance-based retrieval and the precision at the top-ranked image is measured and reported. Testing is performed on our servers after your model submission on a held out test set.

As a reference: SPoC descriptor before any fine-tuning achieves 52.0 precision@1 on the validation set, and 52.0 (coincidentally) precision@1 on the held-out test set.

What you should upload ?

We are providing the overall pipeline in file dml.py that can be found here with some missing parts to be written by you (marked by “your code” in comments). Upload this script after adding your modifications. The CUB dataset can be download from its original source. Additionally train the network locally with triplet loss and hard-negative mining using the indicated hyper-parameter values in the template (or your preferred values if you like) and check if it is correctly improving. The performance of your training implementation will be evaluated on the test set and part of the credits are assigned if improved performance is detected.

Possible extras for further understanding (not for bonus points)

  • Try different values for the margin (e.g. 0 or 0.8) and observe how it affects the training and the validation performance
  • Try contrastive loss, which should be straight-forward to add in the current implementation.
  • What happens if you do not use hard-negatives but pick a random negative instead?
  • The dimensionality of the representation is fixed (512D) and depends on the ResNet18 architecture. To learn lower dimensional descriptors, you can add a Fully Connected layer right after the global pooling to reduce the dimensionality (eg from 512 to 64).
  • What happens if you do not use any random image augmentations during the training?
  • Try to train for much longer (on a GPU). Can a scheduler for the learning rate help?
courses/mpv/labs/deep_metric_learning.txt · Last modified: 2023/05/12 09:50 by sumapave