Table of Contents

Assignment 3+4: EXON detection

This homework is meant as a substitute for homework 3 and 4. If you think that the problem is too open or difficult, please check those two.

As you know, each gene can be split into several exons and introns. During post-transcriptional modification, introns are removed from the RNA transcript, and the final mature RNA remains. In many computational tasks, it is desired to detect exon and intron boundaries. One example might be the alignment of RNA-Seq data. Your task in this homework will be to write a program that is capable of exon detection. We will simplify the problem, our goal will only be the detection of exon starts for Homo Sapiens. We won't try to generalize for multiple species nor detect exon ends.

Start by visiting the Ensembl database. In this database, you may download human chromosome sequences in the FASTA format. Next, download the annotations in the GTF format. The GTF files contain too much information, however, only lines describing exon start are important for this task. Each line contains several tab-separated values. The first value stands for chromosome. The third is feature type. Type exon is the one we will work with. Start position and end position of the exon follow.

16	ensembl_havana	exon	164482	164686	.	+	.	gene_id "ENSG00000206178"; gene_version "2"; transcript_id "ENST00000354915"; transcript_version "3"; exon_number "2"; gene_name "HBZP1"; gene_source "ensembl_havana"; gene_biotype "unprocessed_pseudogene"; transcript_name "HBZP1-201"; transcript_source "ensembl_havana"; transcript_biotype "unprocessed_pseudogene"; exon_id "ENSE00003673193"; exon_version "1"; tag "basic"; transcript_support_level "NA";

Implement a neural network that searches for the exon starts.

Those ideas might help you with your implementation:

Finally, write a short report (no longer than two pages!). Explain how your implementation works, what did you implement, whatnot. Explain where I can find various components of your code. Of course, your implementation does not need to be perfect. Write anything that you (or me) might be curious about in the report. What was the accuracy of your classifier? Does it generalize for different species? How long did it take to learn the model? What obstacles did you have during your implementation?

External libraries

You are allowed to use any external libraries of your choice except for high-level libraries that are designed to use NNs for genomic sequence classification. If you are not sure whether it is ok to use some library, write me in advance.