Table of Contents

HW 4 - Vision Transformer

In this assignment, you will implement a Vision Transformer (ViT) from scratch for satellite image classification using the EuroSAT dataset. You'll build and understand the core components of transformer architecture applied to computer vision tasks.

The assignement is based on the files given in hw04.zip
Do not change anything else in the code especially the default values in the class init, this can lead to failure during the evaluation process.

Dataset

The EuroSAT dataset consists of satellite images for land use and land cover classification:

Image Size: 64×64 pixels

Channels: 3 (RGB)

Classes: 10 land use categories 'Annual Crop', 'Forest', 'Herbaceous Vegetation', 'Highway', 'Industrial', 'Pasture', 'Permanent Crop', 'Residential', 'River', 'Sea/Lake'

Scoring

The assignement is devided into 9 tasks as each of the classes that you will implement in the ViT model. Each task is worth 1 point, except the last which is worth 2 points. The tasks are as follows:

Task 1 - Patch Embeddings (1 point)

Task 2 - Positional Embeddings (1 point)

Task 3 - Transformer Head (1 point)

Task 4 - Multi-Head Attention (1 point)

Task 5 - Feed-Forward Network (1 point)

Task 6 - Transformer Block (1 point)

Task 7 - Transformer Encoder (1 point)

Task 8 - Vision Transformer (1 point)

Task 9 - Accuracy Evaluation (2 points)

Submission and Evaluation

Submit a .zip file containing the model.py file, containing all the classes that are already implemented. There also needs to be a weights.pth file, which will be loaded in BRUTE to evalueate your accuracy.

Rules

Good luck, if you get stuck first consult the materials from the LAB 10 implementation, if it doesn't help feel free to contact capekda4@fel.cvut.cz.