courses:b3b33urob:tutorials:hw4 [CourseWare Wiki]

HW 4 - Vision Transformer

In this assignment, you will implement a Vision Transformer (ViT) from scratch for satellite image classification using the EuroSAT dataset. You'll build and understand the core components of transformer architecture applied to computer vision tasks.

The assignement is based on the files given in hw04.zip

Do not change anything else in the code especially the default values in the class init, this can lead to failure during the evaluation process.

Dataset

The EuroSAT dataset consists of satellite images for land use and land cover classification:

Image Size: 64×64 pixels

Channels: 3 (RGB)

Classes: 10 land use categories 'Annual Crop', 'Forest', 'Herbaceous Vegetation', 'Highway', 'Industrial', 'Pasture', 'Permanent Crop', 'Residential', 'River', 'Sea/Lake'

Scoring

The assignement is devided into 9 tasks as each of the classes that you will implement in the ViT model. Each task is worth 1 point, except the last which is worth 2 points. The tasks are as follows:

Task 1 - Patch Embeddings (1 point)

Implement the patch embeddings in the ViT model. The input image is divided into patches of a given size. Each patch is flattened into a vector and passed through a linear layer to generate the patch embeddings. The output of this layer is a sequence of patch embeddings.
The output shape of the patch embeddings should be (batch_size, num_patches, embed_dim).
The patch embeddings are then passed through the positional embeddings layer.

Task 2 - Positional Embeddings (1 point)

Implement the positional embeddings in the ViT model. The positional embeddings are added to the patch embeddings to encode the spatial information of the patches. Furthermore, you will implement the class token, which is a learnable parameter that encodes the information in the image.
The output shape of the positional embeddings should be (1, num_patches + 1, embed_dim).

Task 3 - Transformer Head (1 point)

Implement the attention mechanism in the transformer head.

Task 4 - Multi-Head Attention (1 point)

Implement the multi-head attention mechanism in the transformer head.

Task 5 - Feed-Forward Network (1 point)

Implement the feed-forward network in the transformer block.

Task 6 - Transformer Block (1 point)

Implement the transformer block that consists of multi-head attention and feed-forward network.

Task 7 - Transformer Encoder (1 point)

Implement the transformer encoder that consists of multiple transformer blocks.

Task 8 - Vision Transformer (1 point)

Implement the Vision Transformer model that consists of the patch embeddings, positional embeddings, and transformer encoder.
The output of the model should be the logits for each class.

Task 9 - Accuracy Evaluation (2 points)

You should train the Vision Transformer model on the EuroSAT dataset and evaluate the accuracy of the model on the test set. The accuracy should be greater than 70%.

Submission and Evaluation

Submit a .zip file containing the model.py file, containing all the classes that are already implemented. There also needs to be a weights.pth file, which will be loaded in BRUTE to evalueate your accuracy.

Rules

Use only the dataset that is given to you, that is if your model overfits focus on using better data augmentations and regulerization techniques such as Dropout and do not go around the internet trying to gather more data.

Tips & Helpful links

PyTorch transforms - check functional transforms for how to apply the same randomized transform (RandomCrop, RandomRotation, …) to the satelite images.
ViT - the paper that introduced the ViT model.
FEL VPN - setup FEL VPN to be able to connect to the GPU directly even from home

Good luck, if you get stuck first consult the materials from the LAB 10 implementation, if it doesn't help feel free to contact capekda4@fel.cvut.cz.

Table of Contents