===== HW 3 - Segmentation ===== In this assignment, your task will be to train a neural network with multi-loss objective, namely: hierarchical classification and semantic segmentation. The dataset consists of images of pets, where each image corresponds to a species (''cat or dog'') and a breed (''25 dog breeds'' and ''12 cat breeds''). For each image there is also a semantic segmentation map with three classes: ''foreground, background & boundary''. The task is to train a model that can determine the species ''p(species|image)'', the breed ''p(breed|image)'' and segmentation mask ''p(mask|image)''. {{ :courses:b3b33urob:tutorials:hw03.zip | hw03.zip}} **UPDATE: make sure in your image and mask transforms you use ''transforms.Resize(128)'' and not ''transforms.Resize%%((128,128))%%'' as was originally in the homework template!!** ==== Scoring ==== **Task 1 - Species classification (1 point)** * accuracy: >85% * classes are: ''dog'' or ''cat'' **Task 2 - Breed classification (3 points)** * top-3 accuracy: >70% * hint: there is a function in pytorch that gives you top-k highest values and their indices * dog breeds: '' 'american_bulldog', 'american_pit_bull_terrier', 'basset_hound', 'beagle', 'boxer', 'chihuahua', 'english_cocker_spaniel', 'english_setter', 'german_shorthaired', 'great_pyrenees', 'havanese', 'japanese_chin', 'keeshond', 'leonberger', 'miniature_pinscher', 'newfoundland', 'pomeranian', 'pug', 'saint_bernard', 'samoyed', 'scottish_terrier', 'shiba_inu', 'staffordshire_bull_terrier', 'wheaten_terrier', 'yorkshire_terrier' '' * cat breeds: '' 'Abyssinian', 'Bengal', 'Birman', 'Bombay', 'British_Shorthair', 'Egyptian_Mau', 'Maine_Coon', 'Persian', 'Ragdoll', 'Russian_Blue', 'Siamese', 'Sphynx' '' **Task 3 - Semantic segmentation (6 points)** * mean IoU: >0.5 (3 points) * min IoU: >0.25 (3 points) * three classes: 0 = foreground, 1 = background, 2 = boundary ==== Submission and Evaluation ==== Submit a .zip file containing all your training & inference code. There needs to be a ''model.py'' file, containing a ''Net'' class which has a method ''predict''. There also needs to be a ''weights.pth'' file, which will be loaded in BRUTE with: model.load_state_dict(torch.load(weights_path, map_location=torch.device('cpu'))) You can save the model with: torch.save(model.state_dict(), "weights.pth") The method takes a single ''3 x 128 x 128'' image as input. (processed with the same transform as in the template: ''Resize'', ''CenterCrop'', ''ToTensor'', ''Normalize'') After computing the predictions, it outputs them in the following format: * species prediction: a **string**, either 'cat' or 'dog' * breed prediction: * a **tuple of three strings**, representing the top-3 most likely breeds for the predicted species * the strings are case sensitive and match the naming convention of the dataset (dogs with lowercase first letter and cats with upper case first letter, i.e.: 'Bengal' not 'bengal') * **IMPORTANT:** the breed prediction must correspond to the predicted species!, i.e. Bengal cat is not a dog * segmentation mask: a ''128 x 128 tensor'' with values ''0, 1 or 2'', representing the background, foreground and boundary classes respectively ==== Tournament ==== Accompanying the assignment, there will be a tournament in which the models will be ranked based on their performance. There will be a separate ranking for each task: species accuracy, top-3 breed accuracy and mean IoU. The final ranking will be determined by the sum of the ranks in all three tasks. The scoring will be based on ranking as follows: * **Top 20** - 1 point * **Top 10** - 2 points * **Top 5** - 3 points * **Top 1** - 5 points ==== Rules ==== * Use only the dataset that is given to you, that is if your model overfits focus on using better data augmentations and regulerization techniques such as ''Dropout'' and do not go around the internet trying to gather more data. The training code you upload must reach a similar result to yours on the provided data subset. * Do not directly download models / weights of the internet trained on this or similar datasets as this would defeat the purpose of this homework, which should be about searching for the best architecture & training recipy. ==== Tips & Helpful links ==== * add ''nn.BatchNorm2d'' and ''nn.BatchNorm1d'' into your conv, transposed conv and fully connected layers. * always Conv/TransposedConv/Linear -> BatchNorm2d/BatchNorm1d -> ReLU * the convolutional part of your network should have enough layers so that the dim of the ''torch.flatten'' output isn't too big and doesn't cause the MLP head to have too many parameters. (ie. ''128*16*16 = 32 768'' -> matrix in the linear layer might have dimension of ''32768x256'', if the output would be ''128x4x4'', then the matrix will be ''2048x256'', which is a smaller jump in dimensionality) * an example architecture of your network could be: * **backbone** - downscales the input dimension into a compressed representation * 5 x ''ConvBlock'', input: ''Bx3x128x128'' output: ''Bx128x4x4'') (each block is ''Conv2D(3x3,pad=1) -> BatchNorm2d -> ReLU -> MaxPool'') * **segmentation head** - takes **backbone** output and feeds into a network made of TransposedConvolution layers which upscales it back to the image dimension and assigns each pixel probabilities that it belongs to a given class * - 5 x ''TransposedConvBlock'', input: ''Bx128x4x4'', output: ''BxCx128x128'' (each block except last is ''TransposedConv2D(2x2,stride=2) -> BatchNorm2d -> ReLU'', last layer is only ''TransposedConv'', ''C=3'' in our case, as we are segmenting into 3 classes) * **species classifier** - takes **backbone** output, feeds it into an MLP and outputs probabilities the image is cat/dog * - 2 x ''Fully Connected Layers'', input: ''Bx(128*4*4)=Bx2048'' and output: ''Bx2'' (each layer except last is ''Linear -> BatchNorm1d -> ReLU'', last layer is only ''Linear'') * **breed classifier** - same as **species classifier**, but outputs 37 probabilities (for each breed) instead of 2 * - 2 x ''Fully Connected Layer'', input: ''Bx(128*4*4)=Bx2024'' and output: ''Bx37'' (for number of breeds, same structure as **species classifier** otherwise) * **loss** is then the sum of the individual losses ''loss = segment_loss + species_loss + breed_loss'' * try adding more convolutions into a ConvBlock (ie. ''Conv2d -> BatchNorm2d -> ReLU -> Conv2d -> BatchNorm2d -> ReLU -> MaxPool'') * ''nn.CrossEntropyLoss'' has an argument ''weight'', which takes in a ''(#classes,)'' shaped tensor and weighs the loss for each example based on its ground truth class (this helps greatly with class imbalance) * if you are encountering overfitting, experiment with adding ''nn.Dropout(prob)'' to your network after some / all blocks. * [[https://paperswithcode.com/method/focal-loss | FocalLoss ]] - modification of CrossEntropyLoss that assigns lower weighs to easy, that is high confidence examples (also help can help with class imbalance) * [[https://pytorch.org/vision/0.13/transforms.html#functional-transforms | PyTorch transforms]] - check functional transforms for how to apply the same randomized transform (RandomCrop, RandomRotation, ...) to both the image and the segmentation mask * [[https://en.wikipedia.org/wiki/U-Net | UNET]] - state of the art architecture for semantic segmentation * [[https://svti.fel.cvut.cz/en/services/vpn.html | FEL VPN]] - setup FEL VPN to be able to connect to the GPU directly even from home Good luck, if you get stuck feel free to consult the web or various chatbots, just make sure to acquire true understanding in the process and not just copy stuff, in the case of any questions or concerns please contact .