Lab 4: Vision Transformers

The main goal of this lab is to understand how Vision Transformers (ViTs) work in practice and how their main components and design choices affect image classification performance.

Skills: image classification with PyTorch, patch embedding, positional encoding, self-attention, transformer blocks, hyperparameter tuning, attention visualization, pretrained Vision Transformers, transfer learning.

Insights: how ViTs process images as patches, how attention behaves, how architecture choices affect performance, and how pretrained transformers compare to custom models.

Jupyter Notebook: pytorch_vision_transformers.ipynb.zip

We recommend the paper “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” for additional reading.

Introduction

In this lab, students are introduced to Vision Transformers through both implementation and experimentation. The notebook combines conceptual questions with practical coding tasks, leading from dataset preparation to building a custom ViT, studying important hyperparameters, visualizing attention maps, and fine-tuning pretrained transformer models.

The purpose of the lab is not only to train a model, but also to understand the internal structure of Vision Transformers and compare them with more standard image classification pipelines.

Notebook structure

1. Setup and data preparation

Students set up the environment, load the CIFAR-10 dataset, inspect the classes, and visualize data distributions.

Students also visualize sample images from CIFAR-10.

2. Introduction to Vision Transformers

This part introduces the main concepts behind Vision Transformers and includes short theoretical questions to support understanding.

3. Building a custom Vision Transformer (ViT) from scratch

Students progressively implement the main components of a ViT in PyTorch, including:

patch embedding PatchEmbedding()
embeddings with class token and positional embeddings Embeddings()
attention head AttentionHead()
multi-head self-attention MultiHeadAttention()
MLP MLP()
transformer block Block()
encoder Encoder()
full Vision Transformer model VisionTransformer()

4. Training the custom ViT

Students complete the training pipeline and train their custom Vision Transformer for image classification.

The default configuration for the custom ViT is:

config = {
    "patch_size": 4,                   # Size of each image patch (4x4 pixels)
    "hidden_size": 48,                 # Dimension of the token embeddings
    "num_hidden_layers": 4,            # Number of transformer encoder blocks
    "num_attention_heads": 4,          # Number of attention heads in multi-head self-attention
    "intermediate_size": 4 * 48,       # Hidden dimension of the MLP inside each transformer block
    "hidden_dropout_prob": 0.0,        # Dropout used after embeddings / MLP layers
    "attention_probs_dropout_prob": 0.0,  # Dropout used on attention weights
    "initializer_range": 0.02,         # Standard deviation used to initialize the weights
    "image_size": 32,                  # Input image size
    "num_classes": 10,                 # Number of output classes in CIFAR-10
    "num_channels": 3,                 # Number of input channels (RGB)
    "qkv_bias": True,                  # Adds bias terms to query, key, and value projections
    "use_faster_attention": True,      # Uses a more efficient attention implementation if available
}

5. Hyperparameter experiments

Students study the effect of important design and optimization choices, such as:

learning rate
patch size
hidden size
number of hidden layers
number of attention heads

Students are expected to create plots for such experiments and answer to theoretical questions based on the observed results.

6. Attention map visualization

Students extract and visualize attention maps from the trained Vision Transformer and interpret what the model attends to.

7. Data augmentation experiments

Students apply augmentation methods and evaluate their impact on model performance.

8. Fine-tuning pre-trained Vision Transformers

Students fine-tune pre-trained Vision Transformer (ViT) models and compare different transfer learning strategies. Fine-tuning a pre-trained ViT involves taking a model that has already been trained on a large-scale dataset (e.g., ImageNet-1k) and adapting it to a new, smaller dataset (e.g., CIFAR-10).

The main benefit is that the model has already learned useful visual features, which can then be adapted to the new task. This reduces both the training time and the computational cost compared to training a model from scratch.

In this assignment, students will use the timm PyTorch library to:

Fine-tune a pre-trained ViT-Tiny (ViT-T/16)
Fine-tune only the classifier head of a pre-trained ViT-Tiny (ViT-T/16)
Fine-tune only the classifier head of a pre-trained ViT-Base (ViT-B/16)

9. Final technical report

Students write a short technical report summarizing the observations and conclusions from the experiments in this notebook. Discuss:

the effect of hyperparameter choices in the custom Vision Transformer,
the main insights from the attention map visualizations,
the differences between (i) full fine-tuning of the custom Vision Transformer, (ii) full fine-tuning and (iii) classifier-only fine-tuning of pre-trained ViTs,
which approach worked best for CIFAR-10, considering both accuracy and computational cost.

What students are expected to complete

Students should complete in the notebook:

all missing code
all experiment sections
all requested plots and visualizations
all answers to the theoretical questions
the final comparison and evaluation of their best model

Submission

Students are expected to submit two files:

1. Completed Jupyter notebook

Submit the fully completed notebook, including:

all code
all generated results
all written answers to the theoretical questions

2. Short report in PDF format

Submit a PDF report containing the figures and plots generated in the notebook.

For this report:

save/export all relevant figures and plots produced by your code
place them into a document written in Markdown (optionally follow this template)
export the document to PDF before submission

The report should clearly present the experimental results and visualizations produced in the notebook.

In summary: the submission for this lab consists of:

the completed Jupyter notebook
a PDF report containing the exported figures/plots from the notebook along with answers to theoretical questions

Table of Contents