In this assignment, you will implement a Vision Transformer (ViT) from scratch for satellite image classification using the EuroSAT dataset.
You'll build and understand the core components of transformer architecture applied to computer vision tasks.
The assignement is based on the files given in
hw04.zip
Do not change anything else in the code especially the default values in the class init, this can lead to failure during the evaluation process.
Dataset
The EuroSAT dataset consists of satellite images for land use and land cover classification:
Image Size: 64×64 pixels
Channels: 3 (RGB)
Classes: 10 land use categories 'Annual Crop', 'Forest', 'Herbaceous Vegetation', 'Highway', 'Industrial', 'Pasture', 'Permanent Crop', 'Residential', 'River', 'Sea/Lake'
Scoring
The assignement is devided into 9 tasks as each of the classes that you will implement in the ViT model. Each task is worth 1 point, except the last which is worth 2 points.
The tasks are as follows:
Task 1 - Patch Embeddings (1 point)
Implement the patch embeddings in the ViT model. The input image is divided into patches of a given size. Each patch is flattened into a vector and passed through a linear layer to generate the patch embeddings. The output of this layer is a sequence of patch embeddings.
The output shape of the patch embeddings should be (batch_size, num_patches, embed_dim).
The patch embeddings are then passed through the positional embeddings layer.
Task 2 - Positional Embeddings (1 point)
Implement the positional embeddings in the ViT model. The positional embeddings are added to the patch embeddings to encode the spatial information of the patches. Furthermore, you will implement the class token, which is a learnable parameter that encodes the information in the image.
The output shape of the positional embeddings should be (1, num_patches + 1, embed_dim).
Task 3 - Transformer Head (1 point)
Task 4 - Multi-Head Attention (1 point)
Task 5 - Feed-Forward Network (1 point)
Task 6 - Transformer Block (1 point)
Task 7 - Transformer Encoder (1 point)
Task 8 - Vision Transformer (1 point)
Implement the Vision Transformer model that consists of the patch embeddings, positional embeddings, and transformer encoder.
The output of the model should be the logits for each class.
Task 9 - Accuracy Evaluation (2 points)
Submission and Evaluation
Submit a .zip file containing the model.py file, containing all the classes that are already implemented.
There also needs to be a weights.pth file, which will be loaded in BRUTE to evalueate your accuracy.
Rules
Tips & Helpful links
PyTorch transforms - check functional transforms for how to apply the same randomized transform (RandomCrop, RandomRotation, …) to the satelite images.
ViT - the paper that introduced the ViT model.
FEL VPN - setup FEL VPN to be able to connect to the GPU directly even from home
Good luck!