Search
Image retrieval, mean average precision, training embeddings (representations) with triplet loss.
Use the provided template. The package contains:
tools.py
triplet.py
view.ipynb
./models/*
In this lab we want to use a neural network to produce an embedding of images $x$ as feature vectors $f \in \mathbb{R}^d$. These feature vectors can be used then to retrieve similar images by finding the closest neighbors in the embedding space.
We will use the MNIST dataset and start with a network trained for classification. We will use its last hidden layer representation as the embedding. The network is a small convolutional network defined as follows
class ConvNet(nn.Sequential): def __init__(self, num_classes: int = 10) -> None: layers = [] layers += [nn.Conv2d(1, 32, kernel_size=3)] layers += [nn.ReLU(inplace=True)] layers += [nn.MaxPool2d(kernel_size=2, stride=2)] layers += [nn.Conv2d(32, 32, kernel_size=3)] layers += [nn.ReLU(inplace=True)] layers += [nn.MaxPool2d(kernel_size=2, stride=2)] layers += [nn.Conv2d(32, 64, kernel_size=3)] layers += [nn.ReLU(inplace=True)] layers += [nn.AdaptiveAvgPool2d((2, 2))] layers += [nn.Flatten()] layers += [nn.Linear(64 * 2 * 2, num_classes)] super().__init__(*layers) self.layers = layers def features(self, x): f = nn.Sequential(*self.layers[:-1]).forward(x) f = nn.functional.normalize(f, p=2, dim=1) return f
features(x)
The Mean Average Precision (mAP) is the mean of $AP$ over random queries. Your task is to implement a function evaluate_AP which computes the average precision given distances to all other data points and their labels. Using this function, implement evaluate_mAP(net, dataset) which computes the mean average precision using the first 100 (random) dataset points as queries, as provided in the template. When computing the average precision, the query itself should be excluded from the set of retrievable items. The expected result for the classification network is ${\rm mAP} = 0.74$.
evaluate_AP
evaluate_mAP(net, dataset)
Report: method used to compute all Euclidean distances, figure with retrieved images from 1. and mAP computed in $2$.
In this part we will learn the embedding to directly facilitate the retrieval performance by training with triplet loss. We will use exactly the same network architecture as in Part 1 but train it differently. For an anchor (query) image $a$ let the positive example $p$ be of the same class and a negative example $n$ be of different class than $a$. We want that each positive example to be closer to the anchor than all negative examples bz a margin $\alpha \geq 0$: $$ d(f(a),f(p)) \leq d(f(a),f(n)) - \alpha \ \ \ \forall p,n.$$ The constraint is violated if $d(f(a),f(p)) - d(f(a),f(n)) + \alpha \geq 0$. Accordingly we define the triplet loss as the total violation of these constraints: $$ l(a) = \sum_{p,n} \max(d(f(a),f(p)) - d(f(a),f(n)) + \alpha, 0).$$ For consistency with the lecture let us define $d(x,y) = \| x-y\|_2^2$. We need to incorporate this loss in the stochastic optimization with mini-batches. Your task is as follows:
triplet_loss
train_triplets