Warning

# Assignment 3+4: EXON detection

• 20 points
• Deadline: Wednesday, April 28th, 2021, 23:59
• Late submission penalty: -1 point per day but no more than 15 points
• Submit to BRUTE.
• Work individually or in groups of two. In that case, email me in advance.
• Submit
• bash script compile.sh that compiles your codes
• a wrapper bash script exons.sh that runs your code
• your source codes must compile either
• on Ubuntu machines in the lab, or
• on a freshly installed Ubuntu machine (if this is the case, provide file packages.txt with a list of packages that need to be installed)
• A short report as explained below.
• An example of submission may be found here.
• This option for the homework is new this year, if you believe that there should be some changes done, please email petr.rysavy@fel.cvut.cz.
This homework is meant as a substitute for homework 3 and 4. If you think that the problem is too open or difficult, please check those two.

As you know, each gene can be split into several exons and introns. During post-transcriptional modification, introns are removed from the RNA transcript, and the final mature RNA remains. In many computational tasks, it is desired to detect exon and intron boundaries. One example might be the alignment of RNA-Seq data. Your task in this homework will be to write a program that is capable of exon detection. We will simplify the problem, our goal will only be the detection of exon starts for Homo Sapiens. We won't try to generalize for multiple species nor detect exon ends.

Start by visiting the Ensembl database. In this database, you may download human chromosome sequences in the FASTA format. Next, download the annotations in the GTF format. The GTF files contain too much information, however, only lines describing exon start are important for this task. Each line contains several tab-separated values. The first value stands for chromosome. The third is feature type. Type exon is the one we will work with. Start position and end position of the exon follow.

16	ensembl_havana	exon	164482	164686	.	+	.	gene_id "ENSG00000206178"; gene_version "2"; transcript_id "ENST00000354915"; transcript_version "3"; exon_number "2"; gene_name "HBZP1"; gene_source "ensembl_havana"; gene_biotype "unprocessed_pseudogene"; transcript_name "HBZP1-201"; transcript_source "ensembl_havana"; transcript_biotype "unprocessed_pseudogene"; exon_id "ENSE00003673193"; exon_version "1"; tag "basic"; transcript_support_level "NA";

Implement a neural network that searches for the exon starts.

• Split the data into train and test data and train your neural network. Always split “by gene”, never use a single gene for learning and testing at the same time.
• You might want to use a convolutional neural network with hot-one encoding.
• The network might use a window for a look-up of the exon start. The exon start might be located in the middle of the window, on the right end of the window, on the left end of the window, or on any other fixed point. Think about the biological nature of the problem and decide what the respective position of the window and expected exon start should be.
• Think about a reasonable window size and balance it with runtime.
• If you have problems with the implementation, write to your tutor, he might try to help you …

Finally, write a short report (no longer than two pages!). Explain how your implementation works, what did you implement, whatnot. Explain where I can find various components of your code. Of course, your implementation does not need to be perfect. Write anything that you (or me) might be curious about in the report. What was the accuracy of your classifier? Does it generalize for different species? How long did it take to learn the model? What obstacles did you have during your implementation?

## External libraries

You are allowed to use any external libraries of your choice except for high-level libraries that are designed to use NNs for genomic sequence classification. If you are not sure whether it is ok to use some library, write me in advance.