Lab 4: CNN visualization & adversarial patterns

CNN deep features visualization. Attention maps. Adversarial patterns and attacks.

Introduction

In this lab we will consider a CNN classifier and visualize activations and attention maps for its hidden layers. We will look for input patterns that maximize activations of specific neurons and see how to craft adversarial attacks fooling the network. All of these tasks share very similar techniques. We recommend you to use Jupyter notebooks for this lab, as computations are relatively light and we need lots of visualization.

Setup

Model

In this lab we will use the pre-trained VGG11 CNN, which you already know from the previous lab. Load it like so

# load network
net = torch.hub.load('pytorch/vision:v0.9.0', 'vgg11', pretrained=True)
net = net.eval().to(device)
 
# we are not changing the network weights/biases in this lab
for param in net.parameters():
    param.requires_grad = False
print(net)

Data

For this lab we need just a few images from ImageNet. We provide an image of a labrador retriever. Choose one or two additional images of your own so, that they demonstrate well the effects studied below. Besides we need the class codes for the 1000 categories in ImageNet. We provide it as text file imagenet_classes.txt

You will need to set up the standard image transformation pipeline for ImageNet:

# image to tensor transform
transform = transforms.Compose([  
    transforms.Resize(224),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225] 
    )])

Part 1: Features, Activations and Gradients Visualisation (3p)

We will first try to understand the trained network by visualising the learned representation:

Load the image(s), apply the classifier and report the top ten classes and their a posteriori probabilities to test the setup. Visualise the input image and the transformed (resampled & normalised) tensor image.
Display the weights of the first convolutional layer as images in a 8 x 8 grid (VGG has 64 channels in the first layer).
Given an input image, your task is to compute the max value over the activation channels for each of the 21 feature maps and to display them. For each layer ($l=0,\ldots,20$) compute the maximum value feature response map. I.e. for each spatial position compute the max over channel activations and display them in a tableau. Use the layer names (layer.__class__.__name__) and numbers as titles.
Next, compute and visualise the “network attention” by computing the gradient of the loss w.r.t. this intermediate outputs. Compute the gradient of the network classification score for the predicted class w.r.t. the feature maps $l=0,\ldots,20$. You could achieve this by forward iterating through the feature layers as above, additionally setting x.retain_grad() and appending each $x$ to a list. Then you need to forward propagate through the rest of the network to its final output (score) and apply .backward() to compute the gradient w.r.t. the feature maps. Display a tableau with the results by showing the $l_2$ norms of the channel gradients (per pixel).

What conclusions are you able to make out of these visualisations? How informative the outputs are. Can you think of a better visualisation? If you have surplus, you may also try some of the GradCAM visualization variants.

Part 2: Hubel & Wiesel like experiments (4p)

Next, we will try to find input patterns that maximise the outputs of a particular neuron in a given layer (recall the work of Hubel and Wiesel –part 1, part 2). Choose one of the following methods (or do both if you like):

Optimize over the input, to find a patch which maximizes the output of the neuron.
Iterate over a dataset (e.g. ImageNet or any other appearance rich enough dataset) and find top 10 activation maximizing patches.

Experiment with different layers and channels and try to find interesting relations to the input.

We will ask you the same question at the end, but without knowing much yet, which approach do you think will produce better insight into the learned representation?

For both methods, you will need to implement a function receptive_field that computes the size of the receptive field for a given layer (see seminar 2). Use this function to find the relevant patch size.

As the first approach is little technical, we provide the following basic template algorithm:

Start from an image $x$ of the size of receptive field initialized with zeros. You will need to make it a trainable parameter in order to compute the gradients of the loss with respect to this input.
Forward propagate $x$ through the network to compute the feature map $y$ of the target layer.
Select the centrally located pixel and the target channel in the feature map $y$.
Use the Adam optimizer to maximize the selected feature (i.e. forward-backward loop with optimizer steps).
Constrain the search to patterns with all components in the range $[-1.0, 1.0]$. You can achieve this simply by clipping the pattern after each gradient step.
Run the gradient ascent for a fixed number of steps.
Find such an activating image $x$ for each channel of the target layer and display them in a panel. Stretch the value range to [x.min, x.max] or use the inverse of the above ImageNet normalization.

You can speed-up the optimization by running it in parallel for all layer channels. This can be achieved by using a batch of activation images (one per target channel) along with the following “trick”:

x = torch.nn.Parameter(torch.zeros(channels, 3, S, S)).to(device)

initialises a zero tensor, where S is the size of the receptive field. If $f$ denotes the output of the considered layer, then the objective is simply

f[:,:,sz[2]//2, sz[3]//2].diag().sum()

where sz is the shape of $f$. However, for later layers, this approach may run out of GPU memory, so you will probably need to resort to an iterative approach.

You will most likely arrive at patches not resembling patches of natural images. Could you explain what is happening?

One way to enforce more realistic patterns is to add smoothness regularisation (natural patches are more smooth on average). Let $x_c$ denote a colour channel of the input pattern and $A$ denote a smoothing convolution. We want to enforce the constraint $$\lVert x_c - A x_c\rVert_1 \leq \epsilon\;.$$ For this we will simply replace $x_c$ by $Ax_c$ after each iteration if the constraint is violated. This can be done by the following code snippet

with torch.no_grad():
  xx = apool(apad(x))
  diff = x - xx
  dn = torch.linalg.norm(diff.flatten(2), dim=2, ord=1.0) / (S * S)
  if dn.max() > epsilon:
    x.data[dn > epsilon] = xx[dn > epsilon]

where

apool = torch.nn.AvgPool2d(3, padding=0, stride=1)
apad = torch.nn.ReplicationPad2d(1)

Tune $\epsilon$ so that the optimal activation patterns resemble natural image patches and display the obtained patterns in a tableau.

The images look better, but are still not very realistic. Can you explain what is happening? Does the method work well for all layers?

So, once again, which of the two approaches is better for studying the learned representation? Why?

Part 3: Targeted Adversarial Attack (3p)

Your task is to implement a targeted iterative adversarial attack.

Choose a clean image which is correctly classified by the net (e.g. the image of the labrador retriever)
Choose a target class different from the true class (e.g. 892: wall clock) and fix an ε > 0. Implement a projected gradient ascent that aims to maximize the softmax output of the target class w.r.t. the input image, but constrains the search to the ε-ball of the $\ell_\infty$ norm around the clean image.
- Start the optimization from the clean image.
- You may use the Adam optimizer for computing the gradient and performing the gradient step. For this you have to require gradients for the input image
- To enforce the constraint, you may e.g. use the following code after each gradient step
```
dx = (x.detach() - x0)
dn = dx.flatten().norm(p=float('inf'))
div = torch.clamp(dn/eps, min=1.0)
dx = dx / div
x.data = x0 + dx
```
  where $x_0$ is the clean image (tensor) and $x$ is the current image (tensor).
Run the projected gradient ascent for a fixed number of steps.
Report the ε which admits a successful attack, show the obtained adversarial example along with the clean image and report the prediction probabilities for them.
Show the network's attention map for both the original and attacked image. Does the attack have effect on this map?

Table of Contents