Search
In this assignment, you will implement a Vision Transformer (ViT) from scratch for satellite image classification using the EuroSAT dataset. You'll build and understand the core components of transformer architecture applied to computer vision tasks.
The EuroSAT dataset consists of satellite images for land use and land cover classification:
Image Size: 64×64 pixels
64×64 pixels
Channels: 3 (RGB)
3 (RGB)
Classes: 10 land use categories 'Annual Crop', 'Forest', 'Herbaceous Vegetation', 'Highway', 'Industrial', 'Pasture', 'Permanent Crop', 'Residential', 'River', 'Sea/Lake'
'Annual Crop', 'Forest', 'Herbaceous Vegetation', 'Highway', 'Industrial', 'Pasture', 'Permanent Crop', 'Residential', 'River', 'Sea/Lake'
The assignement is devided into 9 tasks as each of the classes that you will implement in the ViT model. Each task is worth 1 point, except the last which is worth 2 points. The tasks are as follows:
Task 1 - Patch Embeddings (1 point)
Task 2 - Positional Embeddings (1 point)
Task 3 - Transformer Head (1 point)
Task 4 - Multi-Head Attention (1 point)
Task 5 - Feed-Forward Network (1 point)
Task 6 - Transformer Block (1 point)
Task 7 - Transformer Encoder (1 point)
Task 8 - Vision Transformer (1 point)
Task 9 - Accuracy Evaluation (2 points)
Submit a .zip file containing the model.py file, containing all the classes that are already implemented. There also needs to be a weights.pth file, which will be loaded in BRUTE to evalueate your accuracy.
model.py
weights.pth
Dropout
Good luck, if you get stuck first consult the materials from the LAB 10 implementation, if it doesn't help feel free to contact capekda4@fel.cvut.cz.