Search
The main goal of this lab is to understand how Vision Transformers (ViTs) work in practice and how their main components and design choices affect image classification performance.
Skills: image classification with PyTorch, patch embedding, positional encoding, self-attention, transformer blocks, hyperparameter tuning, attention visualization, pretrained Vision Transformers, transfer learning.
Insights: how ViTs process images as patches, how attention behaves, how architecture choices affect performance, and how pretrained transformers compare to custom models.
Jupyter Notebook: pytorch_vision_transformers.ipynb.zip
We recommend the paper “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” for additional reading.
In this lab, students are introduced to Vision Transformers through both implementation and experimentation. The notebook combines conceptual questions with practical coding tasks, leading from dataset preparation to building a custom ViT, studying important hyperparameters, visualizing attention maps, and fine-tuning pretrained transformer models.
The purpose of the lab is not only to train a model, but also to understand the internal structure of Vision Transformers and compare them with more standard image classification pipelines.
Students set up the environment, load the CIFAR-10 dataset, inspect the classes, and visualize data distributions.
Students also visualize sample images from CIFAR-10.
This part introduces the main concepts behind Vision Transformers and includes short theoretical questions to support understanding.
Students progressively implement the main components of a ViT in PyTorch, including:
PatchEmbedding()
Embeddings()
AttentionHead()
MultiHeadAttention()
MLP()
Block()
Encoder()
VisionTransformer()
Students complete the training pipeline and train their custom Vision Transformer for image classification.
The default configuration for the custom ViT is:
config = { "patch_size": 4, # Size of each image patch (4x4 pixels) "hidden_size": 48, # Dimension of the token embeddings "num_hidden_layers": 4, # Number of transformer encoder blocks "num_attention_heads": 4, # Number of attention heads in multi-head self-attention "intermediate_size": 4 * 48, # Hidden dimension of the MLP inside each transformer block "hidden_dropout_prob": 0.0, # Dropout used after embeddings / MLP layers "attention_probs_dropout_prob": 0.0, # Dropout used on attention weights "initializer_range": 0.02, # Standard deviation used to initialize the weights "image_size": 32, # Input image size "num_classes": 10, # Number of output classes in CIFAR-10 "num_channels": 3, # Number of input channels (RGB) "qkv_bias": True, # Adds bias terms to query, key, and value projections "use_faster_attention": True, # Uses a more efficient attention implementation if available }
Students study the effect of important design and optimization choices, such as:
Students are expected to create plots for such experiments and answer to theoretical questions based on the observed results.
Students extract and visualize attention maps from the trained Vision Transformer and interpret what the model attends to.
Students apply augmentation methods and evaluate their impact on model performance.
Students fine-tune pre-trained Vision Transformer (ViT) models and compare different transfer learning strategies. Fine-tuning a pre-trained ViT involves taking a model that has already been trained on a large-scale dataset (e.g., ImageNet-1k) and adapting it to a new, smaller dataset (e.g., CIFAR-10).
The main benefit is that the model has already learned useful visual features, which can then be adapted to the new task. This reduces both the training time and the computational cost compared to training a model from scratch.
In this assignment, students will use the timm PyTorch library to:
timm
ViT-T/16
ViT-B/16
Students write a short technical report summarizing the observations and conclusions from the experiments in this notebook. Discuss:
Students should complete in the notebook:
Students are expected to submit two files:
Submit the fully completed notebook, including:
Submit a PDF report containing the figures and plots generated in the notebook.
For this report:
The report should clearly present the experimental results and visualizations produced in the notebook.
In summary: the submission for this lab consists of: