Search
Quick links: Schedule | Forum | BRUTE | Lectures | Labs
Image retrieval, mean average precision, training embeddings (representations) with triplet loss and smoothAP loss.
Use the provided template. The package contains:
tools.py
lab.py
view.ipynb
./models/*
In this lab we want to use a neural network to produce an embedding of images $x$ as feature vectors $f \in \mathbb{R}^d$. These feature vectors can be used then to retrieve similar images by finding the closest neighbors in the embedding space.
We will use the FashionMNIST dataset and start with a network trained for classification. We will use its last hidden layer representation as the embedding. The network is a small convolutional network defined as follows
class ConvNet(nn.Sequential): def __init__(self, num_classes: int = 10) -> None: layers = [] layers += [nn.Conv2d(1, 32, kernel_size=3)] layers += [nn.ReLU(inplace=True)] layers += [nn.MaxPool2d(kernel_size=2, stride=2)] layers += [nn.Conv2d(32, 32, kernel_size=3)] layers += [nn.ReLU(inplace=True)] layers += [nn.MaxPool2d(kernel_size=2, stride=2)] layers += [nn.Conv2d(32, 64, kernel_size=3)] layers += [nn.ReLU(inplace=True)] layers += [nn.AdaptiveAvgPool2d((2, 2))] layers += [nn.Flatten()] layers += [nn.Linear(64 * 2 * 2, num_classes)] super().__init__(*layers) self.layers = layers def features(self, x): f = nn.Sequential(*self.layers[:-1]).forward(x) f = nn.functional.normalize(f, p=2, dim=1) return f
features(x)
distances
torch.cdist
torch.einsum
evaluate_AP
evaluate_mAP(net, dataset)
Report and discuss: method used to compute all Euclidean distances, figure with retrieved images from 1, precision-recall curve and mAP.
In this part we will learn the embedding to directly facilitate the retrieval performance by utilizing specialized loss functions. We will use exactly the same network architecture as in Part 1 but we will train it differently.
For an anchor (query) image $a$ let the positive example $p$ be of the same class and a negative example $n$ be of different class than $a$. For conciseness of notation we will define all the quantities and losses for a fixed anchor $a$. The losses are then to be considered in the expectation over $a$.
Let $d_{p} = d(f(a),f(p))$ and $d_{n} = d(f(a),f(n))$ be distances between positive and anchor, and negative and anchor, respectively. We want to learn the embedding $f$ such that all positive examples $p$ would be closer to the anchor than all negative examples: \begin{align}\label{triplet-constraint} d_{p} < d_{n} \ \ \ \forall p,n. \end{align}
The constraint \eqref{triplet-constraint} is violated if $d_{p} - d_{n} \geq 0$. We can use a hinge loss to penalize violation and add a margin to ensure that positives are strictly closer: \begin{align} l(a) = \sum_{p,n} \max(d_{p} - d_{n} + \alpha, 0). \end{align} Assume that we use squared euclidean distance for $d$, it has range $[0,4]$ and so the maximum violation of the constraint is $4$. A reasonable range for choosing the margin $\alpha$ is $[0,0.5]$.
triplet_loss
train_triplets
If our task is like the retrieval task, where mAP was proposed as a criterion to asses the quality of the image retrieval, we can design a loss function more suitable in order to improve mAP criterion directly.
In computing Mean Average Precision (mAP) we compute the average precision (AP) for a given query (anchor) and average over all queries. Therefore, we will again assume that $a$ is fixed, design approximations to AP and consider them in average.
Let $P$ be the set of all positive examples and $N$ the set of all negative examples for the anchor $a$. It can be shown that the average precision can be expressed as \begin{align}\label{AP-pairs} {\rm AP} = 1 - \frac{1}{T} \sum_{p\in P} \frac{\sum_{n\in N} [[d_n < d_p]]}{k(p)}, \end{align}
where $T = |P|$ is the total number of positive examples and \begin{align}\label{k(p)} k(p) = \sum_{x \in P \cup N} [[d_x \leq d_p]] \end{align}
In the expression \eqref{AP-pairs} the numerator counts the number of negative examples which have a smaller distance to the query than a positive example $p$, i.e. they will be incorrectly listed in a sorted list of retrieved items earlier than $p$. The function $k(p)$ expresses the position of $p$ in the sorted list of all examples. Efficiently $1/k(p)$ gives a higher relative weights to errors in the beginning of the retrieval list and discounts errors towards the end of the retrieval list.
We would wish to maximize mAP, therefore the equivalent loss to minimize is the average over positive examples \begin{align} l(a) = \frac{1}{T} \sum_{p\in P} \frac{\sum_{n\in N} [[d_n < d_p]]}{k(p)}, \end{align}
Since $l(a)$ it is composed of non-differentiable indicator functions, we cannot train the embedding $f$ to minimize it directly. We need to approximate (relax) the criterion to something differentiable.
A natural relaxation is to use sigmoid to smoothly approximate the step function. Namely \begin{align} \sigma_\tau(x) = \frac{1}{1 + e^{-x/\tau}} \end{align} approaches $[[x \geq 0]]$ when $\tau \rightarrow 0$. Using this relaxation we obtain a variant of the “Smooth AP” loss \begin{align}\label{AP-sigmoid} l_\text{Smooth-AP}(a) = \sum_{p\in P} \frac{\sum_{n\in N} \sigma_\tau(d_{p} - d_{n})}{k_{\sigma}(p)}, \end{align}
where
\begin{align} k_{\sigma}(p) = \sum_{x \in P \cup N} \sigma_\tau(d_{p} - d_{x}). \end{align}
Applying smoothing to all indicator functions in \eqref{AP-pairs}, including those occurring in $k(p)$ results in the method of A. Brown et al.: Smooth-AP: Smoothing the Path Towards Large-Scale Image Retrieval (2020).
smooth_AP_loss
train_smooth_AP
Report and discuss: Optimization settings, plot of the training progress using the saved history. Evaluation of the trained network as in Part 1 (examples of retrieval and mAP score). Precision-recall curve in comparison with that one for the classification-trained network (display in the same plot). The reference solution has ${\rm mAP=0.79}$ and ${\rm mAP=0.81}$ and the following retrieved images for the exemplar queries for triplet and smoothAP loss, respectively.
The reference precision-recall comparison: